Extracting Business Insurance Details from BOP PDF
This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.
header = page.region(bottom=100)
footer = page.region(bottom=page.height-70)
(header + footer).show()
View full example →
pdf.add_exclusion(lambda page: page.region(bottom=100))
pdf.add_exclusion(lambda page: page.region(top=page.height-70))
View full example →
Extracting Text from Georgia Legislative Bills
This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.
page.region(right=70).show()
View full example →
(
page
.region(right=70)
.find_all('text')
.show(crop='wide')
)
View full example →
(
page
.region(right=70)
.find_all('text')
.right()
.show(crop='wide')
View full example →
ICE Detention Facilities Compliance Report Extraction
This PDF is an ICE report on compliance among detention facilities over the last 20-30 years. Our aim is to extract facility statuses and contract signatories' names and dates. Challenges include strange redactions, blobby text, poor contrast, and ineffective OCR. It has handwritten signatures and dates that are redacted.
left_col = page.region(right=page.width/2 - 15)
left_col.show()
View full example →
Natural PDF basics with text and tables
Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.
top = page.region(top=0, left=0, height=80)
bottom = page.find_all("line")[-1].below()
(top + bottom).show()
View full example →
print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
View full example →
Working with page structure
Extract text from complex multi-column layouts while maintaining proper reading order. Learn techniques for handling academic papers, newsletters, and documents with intricate column structures using Natural PDF's layout detection features.
left = page.region(left=0, right=page.width/3, top=0, bottom=page.height)
mid = page.region(left=page.width/3, right=page.width/3*2, top=0, bottom=page.height)
right = page.region(left=page.width/3*2, right=page.width, top=0, bottom=page.height)
page.highlight(left, mid, right)
View full example →