Animal 911 Calls Extraction from Rainforest Cafe Report
This PDF is a service call report covering 911 incidents at the Rainforest Cafe in Niagara Falls, NY. We're hunting for animals! The data is formatted as a spreadsheet within the PDF, and challenges include varied column widths, borderless tables, and large swaths of missing data.
pages[-1].add_exclusion('text:regex(\\d+ Records Found)')
View full example →
pdf.add_exclusion(
lambda page: page.find_all('text:regex(\\d+ Records Found)')
)
View full example →
Complex Extraction of Law Enforcement Complaints
This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.
pdf.add_exclusion(lambda page: page.find(text='L.E.A. Data Technologies').below(include_source=True))
pdf.add_exclusion(lambda page: page.find(text='Complaints By Date').above(include_source=True))
page.show(exclusions='black')
View full example →
Extracting Business Insurance Details from BOP PDF
This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.
# pdf.add_exclusion('text[color~=red]')
pdf.find_all('text[color~=red]').exclude()
View full example →
pdf.add_exclusion(lambda page: page.region(bottom=100))
pdf.add_exclusion(lambda page: page.region(top=page.height-70))
View full example →
Extracting Data Tables from Oklahoma Booze Licensees PDF
This PDF contains detailed tables listing alcohol licensees in Oklahoma. It has multi-line cells making it hard to extract data accurately. Challenges include alternative row colors instead of lines ("zebra stripes"), complicating row differentiation and extraction.
print("Before exclusions:", page.extract_text()[:200])
# Add exclusions
pdf.add_exclusion(lambda page: page.find(text="PREMISE").above())
pdf.add_exclusion(lambda page: page.find("text:regex(Page \d+ of)").expand())
print("After exclusions:", page.extract_text()[:200])
View full example →
# Add exclusions
pdf.add_exclusion(lambda page: page.find(text="PREMISE").above())
pdf.add_exclusion(lambda page: page.find("text:regex(Page \d+ of)").expand())
print("After exclusions:", page.extract_text()[:200])
View full example →
Extracting Text from Georgia Legislative Bills
This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.
pdf.add_exclusion('text:strikeout')
View full example →
pdf.add_exclusion(lambda page: page.region(right=70))
pdf.add_exclusion(lambda page: page.region(bottom=50))
pdf.add_exclusion(lambda page: page.region(top=page.height-100))
View full example →
Natural PDF basics with text and tables
Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.
# Exclude top header area
page.add_exclusion(top)
# Exclude area below last line
page.add_exclusion(bottom)
View full example →
page.add_exclusion(top)
# Exclude area below last line
page.add_exclusion(bottom)
# Now extract text without excluded areas
text = page.extract_text()
View full example →
print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
View full example →