find_all

160 usages across 10 PDFs

Animal 911 Calls Extraction from Rainforest Cafe Report

This PDF is a service call report covering 911 incidents at the Rainforest Cafe in Niagara Falls, NY. We're hunting for animals! The data is formatted as a spreadsheet within the PDF, and challenges include varied column widths, borderless tables, and large swaths of missing data.

(
  pages[-1]
  .find_all('text:regex(\\d+ Records Found)')
  .show(crop=100)
)

View full example →

pdf.add_exclusion(
  lambda page: page.find_all('text:regex(\\d+ Records Found)')
)

View full example →

columns = ['Number', 'Date Occurred', 'Time Occurred', 'Location', 'Call Type', 'Description', 'Disposition', 'Main Officer']
guide.vertical.from_content(columns, outer="last")
guide.horizontal.from_content(
  lambda p: p.find_all('text:starts-with(NF-)')
)
guide.show()

View full example →

base = Guides(pages[0])
columns = ['Number', 'Date Occurred', 'Time Occurred', 'Location', 'Call Type', 'Description', 'Disposition', 'Main Officer']
base.vertical.from_content(columns, outer="last")
base.horizontal.from_content(pages[0].find_all('text:starts-with(NF-)'))
base.show()

View full example →

for page in pages:
    guides = Guides(page)
    guides.vertical = base.vertical
    guides.horizontal.from_content(page.find_all('text:starts-with(NF-)'))
    single_df = guides.extract_table().to_df(header=columns)
    dataframes.append(single_df)
print("We made", len(dataframes), "dataframes")

View full example →

Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.

page.find_all('text[color~=red]').show()

View full example →

# pdf.add_exclusion('text[color~=red]')
pdf.find_all('text[color~=red]').exclude()

View full example →

Extracting Data Tables from Oklahoma Booze Licensees PDF

This PDF contains detailed tables listing alcohol licensees in Oklahoma. It has multi-line cells making it hard to extract data accurately. Challenges include alternative row colors instead of lines ("zebra stripes"), complicating row differentiation and extraction.

    .find(text="NUMBER")
    .right(include_source=True)
    .expand(top=3, bottom=3)
    .find_all('text')
)
headers.show(crop=100)

View full example →

      width='element',
      include_source=True
    )
    .find_all('text', overlap='partial')
)
rows.show(crop=100, width=700)

View full example →

Extracting Economic Data from Brazil's Central Bank PDF

This PDF is the weekly “Focus” report from Brazil’s central bank with economic projections and statistics. Challenges include commas instead of decimal points, images showing projection changes, and tables without border lines that merge during extraction.

    .find(text='IPCA')
    .below(width='element', include_source=True)
    .clip(data)
    .find_all('text', overlap='partial')
)
headers = row_names.extract_each_text()
headers

View full example →

Extracting State Agency Call Center Wait Times from FOIA PDF

This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.

page.apply_ocr('surya')
page.find_all('text').show(crop=True)

View full example →

table_area.find_all('text').show(crop=True)

View full example →

Extracting Text from Georgia Legislative Bills

This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction.

page.find_all('text:strikeout').show(crop='wide')

View full example →

underlined = page.find_all('text:underline')
print("Underlined text is", underlined.extract_text())
underlined.show(crop='wide')

View full example →

text = pdf.find_all('text:underline').extract_text()
print(text)

View full example →

text = pdf.find_all('text:underline').extract_each_text()
print(text)

View full example →

(
  page
  .region(right=70)
  .find_all('text')
  .show(crop='wide')
)

View full example →

(
  page
  .region(right=70)
  .find_all('text')
  .right()
  .show(crop='wide')
)

View full example →

(
  page
  .region(right=70)
  .find_all('text')
  .right()
  .merge()
  .show(crop='wide')

View full example →

sections = pdf.pages.apply(lambda page: (
    page
        .region(right=70)
        .find_all('text')
        .right()
        .merge()
    )

View full example →

Extracting Use-of-Force Records from Vancouver Police PDF

This PDF contains detailed records of Vancouver Police's use-of-force incidents, provided after a public records request by journalists. Challenges include its very very very small font size and lots of empty whitespace.

headers = page.find_all('text[y0=min()]')
headers.extract_each_text()

View full example →

Natural PDF basics with text and tables

Learn the fundamentals of Natural PDF - opening PDFs, extracting text with layout preservation, selecting elements by criteria, spatial navigation, and managing exclusion zones. Perfect starting point for PDF data extraction.

page.find_all('text').show()

View full example →

texts = page.find_all('text').extract_each_text()
for t in texts[:5]:  # Show first 5
    print(t)

View full example →

top = page.region(top=0, left=0, height=80)
bottom = page.find_all("line")[-1].below()
(top + bottom).show()

View full example →

OCR and AI magic

Master OCR techniques with Natural PDF - from basic text recognition to advanced LLM-powered corrections. Learn to extract text from image-based PDFs, handle tables without proper boundaries, and leverage AI for accuracy improvements.

page.apply_ocr(resolution=50)
page.find_all('text').inspect()

View full example →

page.find_all('text').inspect()

View full example →

page.apply_ocr('surya', detect_only=True)
page.find_all('text').show()

View full example →

Working with page structure

Extract text from complex multi-column layouts while maintaining proper reading order. Learn techniques for handling academic papers, newsletters, and documents with intricate column structures using Natural PDF's layout detection features.

(
    flow
    .find_all('text[width>10]:bold')
    .show()
)

View full example →

regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False

View full example →

# default is YOLO
page.analyze_layout()
page.find_all('region').show(group_by='type')

View full example →

page.analyze_layout('tatr')
page.find_all('region').show(group_by='type')

View full example →

# table-cell
# table-row
# table-column
page.find_all('region[type=table-column]').show(crop=True)

View full example →

# Grab all of the columns
cols = page.find_all('region[type=table-column]')

# Take one of the columns and apply OCR to it
cols[2].apply_ocr()

View full example →

len(cols[2].find_all('text[source=ocr]'))

View full example →

page.analyze_layout()
page.find_all('region').show(group_by="type")

View full example →