expand

163 usages across 6 PDFs

Complex Extraction of Law Enforcement Complaints

This PDF contains a set of complaint records from a local law enforcement agency. Challenges include its relational data structure, unusual formatting common in the region, and redactions that disrupt automatic parsing.

    .find_all('text:contains(Complaint #)')
    .right(include_source=True)
    .merge()
    .expand(top=5, bottom=7)
    .show(crop=section)
)
View full example →
    .find_all('text:contains(Complaint #)')
    .right(include_source=True)
    .merge()
    .expand(top=5, bottom=7)
    .extract_table()
    .to_df(header=['Type of Complaint', 'Description', 'Complaint Disposition'])
)
View full example →
    .find_all('text:contains(Complaint #)')
    .right(include_source=True)
    .merge()
    .expand(top=5, bottom=7)
)

# Build vertical guidelines from lines
View full example →

Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.

(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .show()
)
View full example →
(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .right()
    .extract_text()
)
View full example →

Extracting Data Tables from Oklahoma Booze Licensees PDF

This PDF contains detailed tables listing alcohol licensees in Oklahoma. It has multi-line cells making it hard to extract data accurately. Challenges include alternative row colors instead of lines ("zebra stripes"), complicating row differentiation and extraction.


# Add exclusions
pdf.add_exclusion(lambda page: page.find(text="PREMISE").above())
pdf.add_exclusion(lambda page: page.find("text:regex(Page \d+ of)").expand())

print("After exclusions:", page.extract_text()[:200])
View full example →
    page
    .find(text="NUMBER")
    .right(include_source=True)
    .expand(top=3, bottom=3)
    .find_all('text')
)
headers.show(crop=100)
View full example →

Extracting Economic Data from Brazil's Central Bank PDF

This PDF is the weekly “Focus” report from Brazil’s central bank with economic projections and statistics. Challenges include commas instead of decimal points, images showing projection changes, and tables without border lines that merge during extraction.

(
    sections[0]
    .expand(top=-50)
    .show()
)
View full example →
(
    sections[0]
    .expand(top=-50, right=0)
    .extract_table('stream')
    .to_df(header=False)
    .dropna(axis=0, how='all')
View full example →
dataframes = sections.apply(lambda section: (
    section
        .expand(top=-50, right=0)
        .extract_table('stream')
        .to_df(header=False)
        .dropna(axis=0, how='all')
View full example →

Extracting State Agency Call Center Wait Times from FOIA PDF

This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.

        until='text:contains(Please use the comments)',
        include_endpoint=False
    )
    .expand(
        right=-(page.width * 0.58),
        left=-30,
        bottom=3
View full example →

ICE Detention Facilities Compliance Report Extraction

This PDF is an ICE report on compliance among detention facilities over the last 20-30 years. Our aim is to extract facility statuses and contract signatories' names and dates. Challenges include strange redactions, blobby text, poor contrast, and ineffective OCR. It has handwritten signatures and dates that are redacted.

with left_col.within() as col:
    label = left_col.find("text:closest(Previous Rating)")
    answer = label.below(until='text')
checkbox_region = answer.expand(5).trim()
checkbox_region.show(crop=True)
View full example →
def get_checkbox(region):
  return region.left(20).expand(top=3)

# overlap='partial' because the OCR might be a few pixels off
acceptable = checkbox_region.find(text='Acceptable', overlap='partial')
View full example →