add_exclusion

31 usages across 4 PDFs

Animal 911 Calls Extraction from Rainforest Cafe Report

This PDF is a service call report covering 911 incidents at the Rainforest Cafe in Niagara Falls, NY. We're hunting for animals! The data is formatted as a spreadsheet within the PDF, and challenges include varied column widths, borderless tables, and large swaths of missing data.

pages[-1].add_exclusion('text:regex(\\d+ Records Found)')
View full example →
pdf.add_exclusion(
  lambda page: page.find_all('text:regex(\\d+ Records Found)')
)
View full example →

Complex Extraction of Law Enforcement Complaints

This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.

pdf.add_exclusion(lambda page: page.find(text='L.E.A. Data Technologies').below(include_source=True))
pdf.add_exclusion(lambda page: page.find(text='Complaints By Date').above(include_source=True))

page.show(exclusions='black')
View full example →

Extracting Text from Georgia Legislative Bills

This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.

pdf.add_exclusion('text:strikeout')
View full example →
pdf.add_exclusion(lambda page: page.region(right=70))
pdf.add_exclusion(lambda page: page.region(bottom=50))
pdf.add_exclusion(lambda page: page.region(top=page.height-100))
View full example →

Natural PDF basics with text and tables

Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.

# Exclude top header area
page.add_exclusion(top)

# Exclude area below last line
page.add_exclusion(bottom)
View full example →
page.add_exclusion(top)

# Exclude area below last line
page.add_exclusion(bottom)

# Now extract text without excluded areas
text = page.extract_text()
View full example →
print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
View full example →