Complex Extraction of Law Enforcement Complaints
This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.
pdf.add_exclusion(lambda page: page.find(text='L.E.A. Data Technologies').below(include_source=True))
pdf.add_exclusion(lambda page: page.find(text='Complaints By Date').above(include_source=True))
page.show(exclusions='black')
View full example →
Extracting Data Tables from Oklahoma Booze Licensees PDF
This PDF contains detailed tables listing alcohol licensees in Oklahoma. It has multi-line cells making it hard to extract data accurately. Challenges include alternative row colors instead of lines ("zebra stripes"), complicating row differentiation and extraction.
header = page.find(text="PREMISE").above()
footer = page.find("text:regex(Page \d+ of)")
(header + footer).show()
View full example →
print("Before exclusions:", page.extract_text()[:200])
# Add exclusions
pdf.add_exclusion(lambda page: page.find(text="PREMISE").above())
pdf.add_exclusion(lambda page: page.find("text:regex(Page \d+ of)").expand())
print("After exclusions:", page.extract_text()[:200])
View full example →