apply_ocr - Method Usage

Extracting State Agency Call Center Wait Times from FOIA PDF

This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.

page.apply_ocr('surya')
page.find_all('text').show(crop=True)

View full example →

ICE Detention Facilities Compliance Report Extraction

This PDF is an ICE report on compliance among detention facilities over the last 20-30 years. Our aim is to extract facility statuses and contract signatories' names and dates. Challenges include strange redactions, blobby text, poor contrast, and ineffective OCR. It has handwritten signatures and dates that are redacted.

# pdf.apply_ocr(resolution=192) if we wanted the whole thing
page.apply_ocr(resolution=192)
text = page.extract_text()[:200]
print(text)

View full example →

OCR and AI magic

Master OCR techniques with Natural PDF - from basic text recognition to advanced LLM-powered corrections. Learn to extract text from image-based PDFs, handle tables without proper boundaries, and leverage AI for accuracy improvements.

page.apply_ocr()

View full example →

page.apply_ocr('surya', resolution=192)

View full example →

page.apply_ocr(resolution=50)
page.find_all('text').inspect()

View full example →

# Re-apply the OCR to break it again
page.apply_ocr('surya', resolution=15)

View full example →

page.apply_ocr('surya', detect_only=True)
page.find_all('text').show()

View full example →

Working with page structure

Extract text from complex multi-column layouts while maintaining proper reading order. Learn techniques for handling academic papers, newsletters, and documents with intricate column structures using Natural PDF's layout detection features.

page.find('table').apply_ocr()
text = page.extract_text()
print(text)

View full example →

cols = page.find_all('region[type=table-column]')

# Take one of the columns and apply OCR to it
cols[2].apply_ocr()
text = cols[2].extract_text()
print(text)

View full example →

table_area = page.find("region[type=table]")
table_area.apply_ocr()

View full example →