Complex Table Extraction from OECD Czech PISA Assessment
This PDF is a document from the OECD regarding the PISA assessment, provided in Czech. The main extraction goal is to get the survey question table found on page 9. Challenges include the weird table format, making it hard to extract automatically.
pdf.pages[7].find_all("text:bold:not-empty").dissolve().show()
View full example →
pdf
.pages[6:15]
.find_all('text:bold[size~=14][x0>100]:not-empty')
.dissolve(padding=5)
)
questions.show()
View full example →