Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.

from natural_pdf import PDF
from natural_pdf.analyzers.guides import Guides

pdf = PDF("sample-bop-policy-restaurant.pdf")
page = pdf.pages[0]
page.show()

Look at that watermark!

page.find_all('text[color~=red]').show()

Let's exclude it by finding all reddish text and removing it on each page. We can do this pdf-wide.

# pdf.add_exclusion('text[color~=red]')
pdf.find_all('text[color~=red]').exclude()
<ElementCollection[TextElement](count=1443)>

We can get the policy number by going to the right of the label.

(
    page
    .find(text="POLICY NUMBER")
    .right(until='text')
    .show()
)
(
    page
    .find(text="POLICY NUMBER")
    .right(until='text')
    .extract_text()
)
'DEMO0001-00000-01'

The address is a little different since it spans two (or more? or fewer?) lines. We'll start by grabbing it, and expanding downwards until we hit the next text label.

(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .show()
)

Then we just swing to the right and grab the text across the rest of the page.

(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .right()
    .extract_text()
)
'9 West Mechanic Street\nNew Hope PA 18938'

Hmm what else do we have?

pdf.pages[:10].show(cols=2)

Hmmm let's go to the Service of Suit page. I don't want to think abotu guessing what page it is, so I'll just find the text on it.

page = pdf.find(text="SERVICE OF SUIT").page
page.show()

We probably want to get rid of those headers and footers.

header = page.region(bottom=100)
footer = page.region(bottom=page.height-70)
(header + footer).show()

Might as well get rid of them on every single page while we're at it.

pdf.add_exclusion(lambda page: page.region(bottom=100))
pdf.add_exclusion(lambda page: page.region(top=page.height-70))
<PDF source='sample-bop-policy-restaurant.pdf' pages=111>

And now we can grab the text!

text = page.extract_text()
print(text)
Output
HU 01 05 01 18
SERVICE OF SUIT
This endorsement modifies insurance provided under the following:
COMMERCIAL PROPERTY COVERAGE PART
COMMERCIAL GENERAL LIABILITY COVERAGE PART
COMMERCIAL INLAND COVERAGE PART
BUSINESSOWNERS COVERAGE FORM
Pursuant to any statute of any state, territory or district of the United States which makes provision therefore we
hereby designate the Commissioner, Superintendent or Director of Insurance or other officer specified for that
purpose in the statute, and his successor or successors in office, as our true and lawful attorney upon whom may
be served any lawful process in any action, suit, contract of insurance and hereby designate the Corporate
Secretary of Blackboard Insurance Company, 1209 Orange Street, Wilmington, DE 19801, as the entity to whom
said officer is authorized to mail such process or a true copy thereof.

The rest of the PDF is a low of finding and .below() and .right() and all of that.