Extracting Business Insurance Details from BOP PDF
This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.
Look at that watermark!
Let's exclude it by finding all reddish text and removing it on each page. We can do this pdf-wide.
# pdf.add_exclusion('text[color~=red]')
pdf.find_all('text[color~=red]').exclude()
<ElementCollection[TextElement](count=1443)>
We can get the policy number by going to the right of the label.
(
page
.find(text="POLICY NUMBER")
.right(until='text')
.extract_text()
)
'DEMO0001-00000-01'
The address is a little different since it spans two (or more? or fewer?) lines. We'll start by grabbing it, and expanding downwards until we hit the next text label.
Then we just swing to the right and grab the text across the rest of the page.
(
page
.find(text="Mailing Address")
.expand(bottom='text')
.right()
.extract_text()
)
'9 West Mechanic Street\nNew Hope PA 18938'
Hmm what else do we have?
Hmmm let's go to the Service of Suit page. I don't want to think abotu guessing what page it is, so I'll just find the text on it.
We probably want to get rid of those headers and footers.
Might as well get rid of them on every single page while we're at it.
pdf.add_exclusion(lambda page: page.region(bottom=100))
pdf.add_exclusion(lambda page: page.region(top=page.height-70))
<PDF source='sample-bop-policy-restaurant.pdf' pages=111>
And now we can grab the text!
text = page.extract_text()
print(text)
HU 01 05 01 18 SERVICE OF SUIT This endorsement modifies insurance provided under the following: COMMERCIAL PROPERTY COVERAGE PART COMMERCIAL GENERAL LIABILITY COVERAGE PART COMMERCIAL INLAND COVERAGE PART BUSINESSOWNERS COVERAGE FORM Pursuant to any statute of any state, territory or district of the United States which makes provision therefore we hereby designate the Commissioner, Superintendent or Director of Insurance or other officer specified for that purpose in the statute, and his successor or successors in office, as our true and lawful attorney upon whom may be served any lawful process in any action, suit, contract of insurance and hereby designate the Corporate Secretary of Blackboard Insurance Company, 1209 Orange Street, Wilmington, DE 19801, as the entity to whom said officer is authorized to mail such process or a true copy thereof.
The rest of the PDF is a low of finding and .below()
and .right()
and all of that.