about bad pdfs

bad pdfs is a collection of real-world PDF files that demonstrate the challenges of PDF data extraction. Each example shows how to extract data using Natural PDF, with complete, working code examples.

The gallery started as a way to test natural-pdf on difficult PDFs, but it turns out researchers and journalists deal with these problematic files every day. So we've collected them here to help everyone.

Why "bad" PDFs?

PDFs were designed to display consistently across devices, not to make data extraction easy. Here are the common challenges you'll find in this gallery:

  • Complex layouts with multiple columns
  • Tables without proper structure
  • Scanned images requiring OCR
  • Forms with inconsistent formatting
  • Mixed content types on single pages

We show you how to solve each and every one of these issues with Natural PDF.

How to Use This Gallery

Browse Examples
Find PDFs similar to the ones you're working with. Each example includes the original PDF and extraction approaches.

View Extraction Code
Every example shows complete Python code with the output it produces. The code is tested and ready to use.

Search by Method
Use the search to find examples of specific natural-pdf methods like extract_table() or apply_ocr().

Export to Colab
Run examples directly in Google Colab to experiment with the code on your own PDFs.

Contributing

Have a challenging PDF that others could learn from? We're always looking for new examples, especially from real-world use cases. Submit your own bad PDF here.

About natural-pdf

natural-pdf is a Python library designed to make PDF extraction more intuitive. It combines traditional extraction methods with modern capabilities like:

  • Text and table extraction
  • Spatial navigation (find content relative to other elements)
  • OCR for scanned documents
  • AI-powered extraction for complex layouts
  • Multi-column layout handling

Why not Azure AI Document Intelligence or Marker or Docling or whatever else?

These libraries are designed for big workflows with 99% accuracy across zillions of kinds of documents.

In the world of data journalism, though, you usually have one specific format of PDF, and 10,000 versions of it. You also are really aiming for 100% accuracy, so general-purpose solutions don't work as well.

Credits

bad pdfs is created and maintained by Jonathan Soma, who teaches data journalism at Columbia University and is the author of natural-pdf.

The project is open source and available on GitHub.