dissolve

4 usages across 1 PDF

Complex Table Extraction from OECD Czech PISA Assessment

This PDF is a document from the OECD regarding the PISA assessment, provided in Czech. The main extraction goal is to get the survey question table found on page 9. Challenges include the weird table format, making it hard to extract automatically.

pdf.pages[7].find_all("text:bold:not-empty").dissolve().show()
View full example →
    pdf
    .pages[6:15]
    .find_all('text:bold[size~=14][x0>100]:not-empty')
    .dissolve(padding=5)
)
questions.show()
View full example →