Opening a PDF

Let's start by opening a PDF. Natural PDF can work with local files or URLs.

from natural_pdf import PDF

pdf = PDF("basics.pdf")
page = pdf.pages[0]
page.show()

Grabbing Page Text

You can extract text while preserving the layout, which maintains the spatial arrangement of text on the page.

text = page.extract_text(layout=True)
print(text)
Output
                                                                                    
                                                                                    
                                                                                    
                                                     Jungle Health and Safety Inspection Service
                                                     INS-UP70N51NCL41R              
                                                                                    
       Site: Durham’s Meatpacking Chicago, Ill.                                     
                                                                                    
       Date: February 3, 1905                                                       
                                                                                    
       Violation Count: 7                                                           
       Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
       These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
                                                                                    
       visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
       some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
       into the vats; and when they were fished out, there was never enough of them left to be worth
       exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
       to the world as Durham’s Pure Leaf Lard!                                     
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
       Violations                                                                   
                                                                                    
        Statute Description                                    Level  Repeat?       
        4.12.7 Unsanitary Working Conditions.                  Critical             
                                                                                    
        5.8.3 Inadequate Protective Equipment.                 Serious              
        6.3.9 Ineffective Injury Prevention.                   Serious              
                                                                                    
        7.1.5 Failure to Properly Store Hazardous Materials.   Critical             
        8.9.2 Lack of Adequate Fire Safety Measures.           Serious              
                                                                                    
        9.6.4 Inadequate Ventilation Systems.                  Serious              
        10.2.7 Insufficient Employee Training for Safe Work Practices. Serious      
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                Jungle Health and Safety Inspection Service         

Selecting Elements and Text

Natural PDF provides powerful selectors to find specific elements on the page.

Select text in a rectangle

page.find('rect').show()
text = page.find('rect').extract_text()
print(text)
Output
Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R

Find all text elements

page.find_all('text').show()
texts = page.find_all('text').extract_each_text()
for t in texts[:5]:  # Show first 5
    print(t)
Output
Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site:
Durham’s Meatpacking
Chicago, Ill.

Find colored text

# Find red text
red_text = page.find('text[color~=red]')
print(red_text.extract_text())
Output
INS-UP70N51NCL41R

Find text by content

# Find text starting with specific string
text = page.find('text:contains("INS-")')
print(text.extract_text())
Output
INS-UP70N51NCL41R

Spatial Navigation

Natural PDF excels at spatial relationships between elements.

Extract text to the right of a label

# Extract text to the right of "Date:"
date = page.find(text="Date:").right(height='element')
date.show()
date.extract_text()
'February 3, 1905'

Extract tables

table = page.extract_table()
if table:
    df = table.to_df()
    print(df.head())
Output
  Statute                                     Description     Level Repeat?
0  4.12.7                  Unsanitary Working Conditions.  Critical    <NA>
1   5.8.3                Inadequate Protective Equipment.   Serious    <NA>
2   6.3.9                  Ineffective Injury Prevention.   Serious    <NA>
3   7.1.5  Failure to Properly Store Hazardous Materials.  Critical    <NA>
4   8.9.2          Lack of Adequate Fire Safety Measures.   Serious    <NA>

Exclusion Zones

Sometimes you need to exclude headers, footers, or other unwanted areas from extraction.

Exclude specific regions

top = page.region(top=0, left=0, height=80)
bottom = page.find_all("line")[-1].below()
(top + bottom).show()
# Exclude top header area
page.add_exclusion(top)

# Exclude area below last line
page.add_exclusion(bottom)

# Now extract text without excluded areas
text = page.extract_text()
print(text)
Output
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious

PDF-level exclusions

Apply exclusions to all pages in a PDF:

print("BEFORE EXCLUSION:", pdf.pages[0].extract_text()[:200])
# Add header exclusion to all pages
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
print("AFTER EXCLUSION:", pdf.pages[0].extract_text()[:200])
Output
BEFORE EXCLUSION: Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people coul
AFTER EXCLUSION: Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people coul