Complex Extraction of Law Enforcement Complaints
This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.
# Build vertical guidelines from lines
guides = Guides(table)
guides.vertical.from_lines(n=4)
# Use the guides
(
View full example →
)
guides = Guides(table)
guides.vertical.from_lines(n=8)
(
table
View full example →
# Use the guides to extract the table
guides = Guides(table)
guides.vertical.from_lines(n=8)
columns = ['Name', 'ID No.', 'Rank', 'Division', 'Officer Disposition', 'Action Taken', 'Body Cam']
officer_df = (
table
View full example →
Extracting State Agency Call Center Wait Times from FOIA PDF
This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.
guide = Guides(table_area)
guide.vertical.divide(3)
guide.vertical.snap_to_whitespace(detection_method='text')
guide.horizontal.from_lines()
guide.show()
View full example →
OCR and AI magic
Master OCR techniques with Natural PDF - from basic text recognition to advanced LLM-powered corrections. Learn to extract text from image-based PDFs, handle tables without proper boundaries, and leverage AI for accuracy improvements.
guides.vertical.snap_to_whitespace(detection_method='text')
# add in horizontal lines in places where 80% of the pixels are 'used'
guides.horizontal.from_lines(threshold=0.8)
# Honestly you could have done the same thing for the vertical lines
# but it isn't as fun as .from_content, you know?
View full example →
# Honestly you could have done the same thing for the vertical lines
# but it isn't as fun as .from_content, you know?
# n=5 finds the 5 most likely places based on pixel density
# guides.vertical.from_lines(n=5)
guides.show()
View full example →
Working with page structure
Extract text from complex multi-column layouts while maintaining proper reading order. Learn techniques for handling academic papers, newsletters, and documents with intricate column structures using Natural PDF's layout detection features.
from natural_pdf.analyzers import Guides
guides = Guides(table_area)
guides.vertical.from_lines(threshold=0.6)
guides.horizontal.from_lines(threshold=0.6)
guides.show()
View full example →
guides = Guides(table_area)
guides.vertical.from_lines(threshold=0.6)
guides.horizontal.from_lines(threshold=0.6)
guides.show()
View full example →