CIA Document Classification
Let's work with a declassified CIA document and use AI to classify and extract information.
Just like we did above, we can ask what category we think the PDF belongs to.
pdf.classify(
['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'],
using='text'
)
(pdf.category, pdf.category_confidence)
('birding', 0.5170503258705139)
I promise birding is real! The PDF is about using pigeons to take surveillance photos.
But beyond the text content, notice how all of the pages look very very different. We can also categorize each page using vision!
pdf.classify_pages(
['diagram', 'text', 'invoice', 'blank'],
using='vision'
)
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
Output
Page 1 is text - 0.633 Page 2 is text - 0.957 Page 3 is text - 0.921 Page 4 is diagram - 0.895 Page 5 is diagram - 0.891 Page 6 is invoice - 0.919 Page 7 is text - 0.834 Page 8 is invoice - 0.594 Page 9 is invoice - 0.971 Page 10 is invoice - 0.987 Page 11 is invoice - 0.994 Page 12 is invoice - 0.992 Page 13 is text - 0.822 Page 14 is text - 0.936 Page 15 is diagram - 0.913 Page 16 is text - 0.617 Page 17 is invoice - 0.868 Classifying batch (openai/clip-vit-base-patch16): 0%| | 0/17 [00:00<?, ?it/s]
And if we just want to see the pages that are diagrams, we can .filter
for them.
We can also put them into groups.
groups = pdf.pages.groupby(lambda page: page.category)
groups.info()
Output
PageGroupBy with 3 groups: ---------------------------------------- [0] 'text': 7 pages [1] 'diagram': 3 pages [2] 'invoice': 7 pages Grouping by function: 0%| | 0/17 [00:00<?, ?pages/s] Grouping by function: 100%|##########| 17/17 [00:00<00:00, 43851.89pages/s]
And if that's all we're interested in? We can save a new PDF of just those pages!
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.save_pdf("diagrams.pdf", original=True)
)