bad pdfs

Extracting Data Tables from Oklahoma Booze Licensees PDF

This PDF contains detailed tables listing alcohol licensees in Oklahoma. It has multi-line cells making it hard to extract data accurately. Challenges include alternative row colors instead of lines ("zebra stripes"), complicating row differentiation and extraction.

Guides PDF above +10 more

24480polcompleted.pdf

Animal 911 Calls Extraction from Rainforest Cafe Report

This PDF is a service call report covering 911 incidents at the Rainforest Cafe in Niagara Falls, NY. We're hunting for animals! The data is formatted as a spreadsheet within the PDF, and challenges include varied column widths, borderless tables, and large swaths of missing data.

Guides PDF add_exclusion +5 more

k046682-111320-opa-lea-database-install_1.pdf

Complex Extraction of Law Enforcement Complaints

This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.

Guides PDF above +13 more

use-of-force-raw.pdf

Extracting Use-of-Force Records from Vancouver Police PDF

This PDF contains detailed records of Vancouver Police's use-of-force incidents, provided after a public records request by journalists. Challenges include its very very very small font size and lots of empty whitespace.

Guides PDF extract_each_text +5 more

sample-bop-policy-restaurant.pdf

Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.

Guides PDF add_exclusion +8 more

liberty-county-boe.pdf

Bad OCR in a board of education annual financial report

This PDF is all sorts of information about the Board of Education in Liberty County, Georgia

PDF extract_text find +3 more

ocr-example.pdf

OCR and AI magic

Master OCR techniques with Natural PDF - from basic text recognition to advanced LLM-powered corrections. Learn to extract text from image-based PDFs, handle tables without proper boundaries, and leverage AI for accuracy improvements.

Guides PDF apply_ocr +12 more

mednine.pdf

Arabic Election Results Table Extraction from Mednine PDF

This PDF has a data table showing election results from the Tunisian region of Mednine. Challenges include spanning header cells and rotated headers. It has Arabic script.

Flow PDF apply +5 more

multicolumn.pdf

Working with page structure

Extract text from complex multi-column layouts while maintaining proper reading order. Learn techniques for handling academic papers, newsletters, and documents with intricate column structures using Natural PDF's layout detection features.

Flow Guides PDF +12 more

ICE Detention Facilities Compliance Report Extraction

pomonajailpomonaca06212004.pdf

Has Errors

ICE Detention Facilities Compliance Report Extraction

This PDF is an ICE report on compliance among detention facilities over the last 20-30 years. Our aim is to extract facility statuses and contract signatories' names and dates. Challenges include strange redactions, blobby text, poor contrast, and ineffective OCR. It has handwritten signatures and dates that are redacted.

Judge PDF add +10 more

basics.pdf

Natural PDF basics with text and tables

Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.

PDF add_exclusion extract_each_text +8 more

serbia-zakon-o-naknadama-za-koriscenje-javnih.pdf

Extracting Complex Data from Serbian Regulatory PDF

This PDF contains parts of Serbian policy documents, crucial for a research project analyzing industry policies across countries. The challenge lies in extracting a large table that spans pages (page 90 to 97) and a math formula on page 98, all in Serbian. Both elements lack clear boundaries between pages, complicating extraction.

Guides PDF below +4 more

20252026-236232.pdf

Extracting Text from Georgia Legislative Bills

This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.

PDF add_exclusion apply +7 more

cia-doc.pdf

CIA Document Analysis

Extracting information from declassified CIA documents using AI

PDF classify classify_pages +5 more

statecallcenterdata_redacted.pdf

Extracting State Agency Call Center Wait Times from FOIA PDF

This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.

Guides PDF apply_ocr +11 more

czech-republic-pisa2012_zakovsky_dotaznik_a.pdf

Complex Table Extraction from OECD Czech PISA Assessment

This PDF is a document from the OECD regarding the PISA assessment, provided in Czech. The main extraction goal is to get the survey question table found on page 9. Challenges include the weird table format, making it hard to extract automatically.

PDF dissolve extract_text +3 more

focus.pdf

Extracting Economic Data from Brazil's Central Bank PDF

This PDF is the weekly “Focus” report from Brazil’s central bank with economic projections and statistics. Challenges include commas instead of decimal points, images showing projection changes, and tables without border lines that merge during extraction.

PDF assign below +10 more