Extracting State Agency Call Center Wait Times from FOIA PDF
This PDF contains data on wait times at a state agency call center. The main focus is on the data on the first two pages, which matches other states' submission formats. The later pages provide granular breakdowns over several years. Challenges include it being heavily pixelated, making it hard to read numbers and text, with inconsistent and unreadable charts.
The submission said "the first two pages" so I'm going with that. The rest of the pages are insane and will need a wholly separate writeup.
The pages are images so they don't have text, but we can always double-check.
# No results? Needs OCR!
print(page.extract_text())
I love surya so I'm going to use it instead of the default of easyocr. Two ways to check the results: look at where it found text and look at what the text is.
page.apply_ocr('surya')
page.find_all('text').show(crop=True)
Rendering pages: 0%| | 0/1 [00:00<?, ?it/s] Detecting bboxes: 0%| | 0/1 [00:00<?, ?it/s] Detecting bboxes: 100%|##########| 1/1 [00:03<00:00, 3.09s/it] Detecting bboxes: 100%|##########| 1/1 [00:03<00:00, 3.09s/it] Recognizing Text: 0%| | 0/73 [00:00<?, ?it/s] Recognizing Text: 1%|1 | 1/73 [00:36<43:54, 36.59s/it] Recognizing Text: 3%|2 | 2/73 [00:38<18:53, 15.96s/it] Recognizing Text: 4%|4 | 3/73 [00:40<11:13, 9.62s/it] Recognizing Text: 5%|5 | 4/73 [00:40<06:55, 6.03s/it] Recognizing Text: 7%|6 | 5/73 [00:42<05:02, 4.45s/it] Recognizing Text: 8%|8 | 6/73 [00:42<03:29, 3.13s/it] Recognizing Text: 10%|9 | 7/73 [00:43<02:31, 2.29s/it] Recognizing Text: 11%|# | 8/73 [00:50<04:01, 3.72s/it] Recognizing Text: 12%|#2 | 9/73 [00:51<03:04, 2.88s/it] Recognizing Text: 16%|#6 | 12/73 [00:51<01:21, 1.33s/it] Recognizing Text: 19%|#9 | 14/73 [00:52<00:56, 1.05it/s] Recognizing Text: 21%|## | 15/73 [00:57<01:44, 1.80s/it] Recognizing Text: 22%|##1 | 16/73 [00:58<01:26, 1.52s/it] Recognizing Text: 23%|##3 | 17/73 [00:59<01:20, 1.43s/it] Recognizing Text: 25%|##4 | 18/73 [01:02<01:44, 1.90s/it] Recognizing Text: 26%|##6 | 19/73 [01:02<01:22, 1.53s/it] Recognizing Text: 27%|##7 | 20/73 [01:03<01:06, 1.26s/it] Recognizing Text: 29%|##8 | 21/73 [01:04<00:54, 1.05s/it] Recognizing Text: 30%|### | 22/73 [01:09<01:55, 2.26s/it] Recognizing Text: 33%|###2 | 24/73 [01:09<01:06, 1.36s/it] Recognizing Text: 34%|###4 | 25/73 [01:10<01:01, 1.29s/it] Recognizing Text: 38%|###8 | 28/73 [01:11<00:32, 1.39it/s] Recognizing Text: 40%|###9 | 29/73 [01:16<01:09, 1.58s/it] Recognizing Text: 41%|####1 | 30/73 [01:17<01:03, 1.47s/it] Recognizing Text: 53%|#####3 | 39/73 [01:26<00:37, 1.09s/it] Recognizing Text: 63%|######3 | 46/73 [01:28<00:20, 1.30it/s] Recognizing Text: 67%|######7 | 49/73 [01:29<00:15, 1.56it/s] Recognizing Text: 77%|#######6 | 56/73 [01:29<00:06, 2.47it/s] Recognizing Text: 79%|#######9 | 58/73 [01:30<00:05, 2.60it/s] Recognizing Text: 81%|######## | 59/73 [01:30<00:05, 2.52it/s] Recognizing Text: 82%|########2 | 60/73 [01:32<00:06, 2.10it/s] Recognizing Text: 84%|########3 | 61/73 [01:33<00:06, 1.78it/s] Recognizing Text: 85%|########4 | 62/73 [01:40<00:18, 1.67s/it] Recognizing Text: 86%|########6 | 63/73 [01:44<00:21, 2.19s/it] Recognizing Text: 89%|########9 | 65/73 [02:06<00:43, 5.47s/it] Recognizing Text: 90%|######### | 66/73 [02:08<00:33, 4.72s/it] Recognizing Text: 92%|#########1| 67/73 [02:10<00:23, 3.93s/it] Recognizing Text: 93%|#########3| 68/73 [02:11<00:16, 3.26s/it] Recognizing Text: 96%|#########5| 70/73 [02:12<00:06, 2.17s/it] Recognizing Text: 97%|#########7| 71/73 [02:13<00:03, 1.81s/it] Recognizing Text: 100%|##########| 73/73 [02:13<00:00, 1.82s/it]
And now we'll look at what the text is.
print(page.extract_text(layout=True))
On-Demand Interviews - Interim and Final Reporting Complete all shaded fields Figure Comments* State: Average call wait time for interview in minutes 19 Minutes <b>Report Start Date</b> 9/1/2019 Number of all calls that result in a completed .interview 34,396 <b>Report End Date</b> 3/31/2020 Percent 98.31% Number of applications Average call completion time in minutes 29 filed 34438 Number of recertification: Number of dropped calls 627 filed 17604 These were the completed Face to face Number of applications Number of requests for an in-person interview 16982 Intervews. There is no way to track in the intervlewed on 1st day 22409 Number of NOMIs sent for failure to complete initial application interview 7809 Percent 22.60% Number of NOMIs sent for failure to complete- recertification interview. 13007 Percent 73.80% Number of applications denied for failure to complete the interview in 30 days 3885 Percent 45.00% Number of recertifications denied for failure to complete the interview in 30 days. 74 Percent 10.00% *Please use the comments field for any clarifications or context that are needed for any data points. <b>Notes on Measures</b> 1. Average call wait time for interview: Do not include abandoned and dropped calls. "Wat time" ends when the eligibility worker answers the call to begin the interview 2. Number and percent of all calls that result in a completed interview; Include abandoned and eropped calls in the denominator when calculating the percentage. 3. Average call completion time in minutes: "Completion time" means the full duration of time the client spends on the call, beginning when the client. nters the call center queue until the Interview is completed 4. Number of dropped rails: Include all calls disconnected due to call center error, lack of call center capacity, etc. Do not include abandoned calls, uring which the client terminates the call before completion. 5. Number of requests for an in-person interview: This figure should represent the number of clients who were given directions to complete their interview through the call certer, but requested an in-person interview instead. 6. Number and percent of NOMIs sent for failure to complete initial application interview: Include all applications across the casebad in the ferominator when calculating the percentage. 7. Number and percent of NOMIs sent for failure to complete the recertification interview: Include all recertifications across the caseload in the nator when calculating the percentage. 8. Number and percent of all applications denied that were denied due to failure to complete the interview in 30 days: include all applications across the caselcad for which a denial was ssued in the denominator when calculating the percentage. 9. Number and percent of all recertifications denied that were denied due to failure to complete the interviev in 30 days: Include all recertifications across the caselead for which a denial was issued in the denominator when calculating the percentage.
To get the table area, we get everything from the "Figure" header down to "Please use the comments field"
We need to cut it in on the sides a little bit, and expand it on the bottom. I just pick some manual values because I'm lazy, should probably be a better way to resize things based on selectors.
Now we can see all the text in our area.
For some reason we can't just use .extract_table('stream')
on this, even though there are some nice gaps between each column. Oh well!
Instead we'll throw three vertical dividers in and then shuffle then around until they don't intersect any of the text. The horizontal borders are easier because they're just lines.
And now we can grab the table!
df = (
guide
.extract_table()
.to_df(
header=['value', 'amount', 'comments']
)
)
df
value | amount | comments | |
---|---|---|---|
0 | Average call wait time for interview in minute... | 19 Minutes | None |
1 | Number of all calls that result in a completed... | 34,396 | None |
2 | Percent | 98.31% | None |
3 | Average call completion time in minutes | 29 | None |
4 | Number of dropped calls | 627 | None |
5 | Number of requests for an in-person interview\... | 16982 Intervews. There is no way to track in the | These were the completed Face to face\nInterve... |
6 | Number of NOMIs sent for failure to complete\n... | 7809 | None |
7 | Percent | 22.60% | None |
8 | Number of NOMIs sent for failure to complete-\... | 13007 | None |
9 | Percent\nNumber of applications denied for fai... | 73.80% | None |
10 | Number of applications denied for failure to\n... | 3885 | None |
11 | Percent | 45.00% | None |
12 | Number of recertifications denied for failure ... | 74 | None |
13 | Percent | 10.00% | None |
The next page is....... too hard for now.