Opening a PDF
Let's start by opening a PDF. Natural PDF can work with local files or URLs.
Grabbing Page Text
You can extract text while preserving the layout, which maintains the spatial arrangement of text on the page.
text = page.extract_text(layout=True)
print(text)
Output
Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham’s Pure Leaf Lard! Violations Statute Description Level Repeat? 4.12.7 Unsanitary Working Conditions. Critical 5.8.3 Inadequate Protective Equipment. Serious 6.3.9 Ineffective Injury Prevention. Serious 7.1.5 Failure to Properly Store Hazardous Materials. Critical 8.9.2 Lack of Adequate Fire Safety Measures. Serious 9.6.4 Inadequate Ventilation Systems. Serious 10.2.7 Insufficient Employee Training for Safe Work Practices. Serious Jungle Health and Safety Inspection Service
Selecting Elements and Text
Natural PDF provides powerful selectors to find specific elements on the page.
Select text in a rectangle
text = page.find('rect').extract_text()
print(text)
Output
Jungle Health and Safety Inspection Service INS-UP70N51NCL41R
Find all text elements
texts = page.find_all('text').extract_each_text()
for t in texts[:5]: # Show first 5
print(t)
Output
Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Ill.
Find colored text
# Find red text
red_text = page.find('text[color~=red]')
print(red_text.extract_text())
Output
INS-UP70N51NCL41R
Find text by content
# Find text starting with specific string
text = page.find('text:contains("INS-")')
print(text.extract_text())
Output
INS-UP70N51NCL41R
Spatial Navigation
Natural PDF excels at spatial relationships between elements.
Extract text to the right of a label
date.extract_text()
'February 3, 1905'
Extract tables
table = page.extract_table()
if table:
df = table.to_df()
print(df.head())
Output
Statute Description Level Repeat? 0 4.12.7 Unsanitary Working Conditions. Critical <NA> 1 5.8.3 Inadequate Protective Equipment. Serious <NA> 2 6.3.9 Ineffective Injury Prevention. Serious <NA> 3 7.1.5 Failure to Properly Store Hazardous Materials. Critical <NA> 4 8.9.2 Lack of Adequate Fire Safety Measures. Serious <NA>
Exclusion Zones
Sometimes you need to exclude headers, footers, or other unwanted areas from extraction.
Exclude specific regions
# Exclude top header area
page.add_exclusion(top)
# Exclude area below last line
page.add_exclusion(bottom)
# Now extract text without excluded areas
text = page.extract_text()
print(text)
Output
Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham’s Pure Leaf Lard! Violations Statute Description Level Repeat? 4.12.7 Unsanitary Working Conditions. Critical 5.8.3 Inadequate Protective Equipment. Serious 6.3.9 Ineffective Injury Prevention. Serious 7.1.5 Failure to Properly Store Hazardous Materials. Critical 8.9.2 Lack of Adequate Fire Safety Measures. Serious 9.6.4 Inadequate Ventilation Systems. Serious 10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
PDF-level exclusions
Apply exclusions to all pages in a PDF:
print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
Output
BEFORE EXCLUSION: Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people coul AFTER EXCLUSION: Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people coul