Bad OCR in a board of education annual financial report

So we have a reasonably long PDF (72 pages) that we want to grab a single page of information from. On top of everything else the text recognition (OCR) is bad. We'll need to redo that, so we'll start by reading the PDF in with text_layer=False to have it discard the incorrect text.

from natural_pdf import PDF

pdf = PDF("liberty-county-boe.pdf", text_layer=False)
pdf.pages.show(cols=6)

Now we need to apply new OCR to it.

We're impatient and only care about one specific page, and we know the page is somewhere near the front. To speed things up, we'll apply OCR to a subset of the pages.

pdf.pages[5:20].apply_ocr()
Output
Rendering pages:   0%|          | 0/15 [00:00<?, ?it/s]
Rendering pages:  60%|######    | 9/15 [00:00<00:00, 83.46it/s]
                                                               
Using CPU. Note: This module is much faster with a GPU.
/home/runner/work/badpdfs-site/badpdfs-site/processor/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:666: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
  warnings.warn(warn_msg)
<PageCollection(count=15)>

Now we can look for the content we're interested in.

pdf.find(text="FINANCIAL HIGHLIGHTS").show()

Granted if our OCR was off we might not be able to just grab what we're looking for, but luckily it's printed very nicely and we can almost guarantee the text comes through well.

We can preview to make sure the page looks right...

page = pdf.find(text="FINANCIAL HIGHLIGHTS").page
page.show()

...and then pull out the text, save it to a file, whatever we want.

text = page.extract_text()
print(text)

with open("content.txt", 'w') as fp:
    fp.write(text)
Output
LIBERTY COUNTY BOARD OF EDUCATION
MANAGEMENT'S DISCUSSION AND ANALYSIS
FOR THE YEAR ENDED JUNE 30,2019
INTRODUCTION
Thediscussionand analysis of Liberty CountyBoardof Education's (School  District)   financial
performance provides an overall review of the School District's financial activities for the fiscal year
ended June 30, 2019_ The intent of this discussion and analysis is to look at the School District's
financial  performance asawhole: Readers shouldalso review the notes to the basic financial
statements to enhance their understanding of the School District's financial performance_
FINANCIAL HIGHLIGHTS
Keyfinancial highlights for fiscal year 2019 are as follows:
On the government-wide financial statements; the assets and deferred outflows of the
School District exceeded liabilities and deferred inflows by $52.7 million:
General revenues accounted for $51.0 million in revenue or 42.7% of all revenues_
Program specific revenues in the form of charges for services, operating and capital
grants and contributions accounted for $68.6 million in revenue or 57.3% 0f total
revenues.  Total revenues were $119.6 million.
The School Districthad$113.5 million in expenses relating to governmental activities;
only $68.6 million of these expenses are offset by program specific charges for
services and grants and contributions_General revenues (primarily property taxes and
sales taxes) of $51.0 million, along with the School District's beginning net position;
were adequate to provide for these programs
On the government-wide financial statements, the School District reported deferred
inflows of resources of $18.9 million and deferred outflows of resources of $23.6
million related to defined benefit pension plans recognized by the implementation of
GASB Statements No_68 and No. 71 and other postemployment benefits recognized
by the implementation of GASB Statement No. 75
OVERVIEW OF THE FINANCIAL STATEMENTS
This annual report consists of several parts including management's discussion and analysis, the basic
financial statements and required supplementary information: The basic financial statements include
two levelsof statements that present different  views of theSchool District.These include the
government-wide and fund financial statementsThis discussion and analysis of the School District's
financial statements provides an overview of its financial activities for the year. Comparative data is
provided for fiscal year 2019and 2018
The government-wide financial statements include the Statement of Net Position and the Statement
of   Activities:Thesestatements   provideinformationabout theactivitiesof the SchoolDistrict
presenting both short-term and long-term information about the School District's overall financial
status
The fund financial statements focus on the individual parts of the School District, reporting the School
District 's operation in more detail:The governmental fund financial statements disclose how basic
services are financed in the short-term as well as what remains for future spending:The fiduciary
funds statement provides information about the financial relationships in which the School District
acts solely as an agent for the benefit of others: The fund financial statements reflect the School
District's mostsignificant funds: In the case of the Liberty County Board of Education, the general fund
and capital projects fund are the most significant funds:

If we wanted to pass this over to someone else to double-check, we could even save the page itself as an image. We use .render() instead of .show() because it by default won't include highlights and annotations and that kind of stuff.

page.render().save("output.png")