Animal 911 Calls Extraction from Rainforest Cafe Report
This PDF is a service call report covering 911 incidents at the Rainforest Cafe in Niagara Falls, NY. We're hunting for animals! The data is formatted as a spreadsheet within the PDF, and challenges include varied column widths, borderless tables, and large swaths of missing data.
pages[-1].add_exclusion('text:regex(\\d+ Records Found)')
View full example →
pdf.add_exclusion(
lambda page: page.find_all('text:regex(\\d+ Records Found)')
)
View full example →
Complex Extraction of Law Enforcement Complaints
This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.
pdf.add_exclusion(lambda page: page.find(text='L.E.A. Data Technologies').below(include_source=True))
pdf.add_exclusion(lambda page: page.find(text='Complaints By Date').above(include_source=True))
page.show(exclusions='black')
View full example →
Extracting Text from Georgia Legislative Bills
This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.
pdf.add_exclusion('text:strikeout')
View full example →
pdf.add_exclusion(lambda page: page.region(right=70))
pdf.add_exclusion(lambda page: page.region(bottom=50))
pdf.add_exclusion(lambda page: page.region(top=page.height-100))
View full example →
Natural PDF basics with text and tables
Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.
# Exclude top header area
page.add_exclusion(top)
# Exclude area below last line
page.add_exclusion(bottom)
View full example →
page.add_exclusion(top)
# Exclude area below last line
page.add_exclusion(bottom)
# Now extract text without excluded areas
text = page.extract_text()
View full example →
print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
View full example →