Scanning documents with Tesseract with HOCR

Han Oostdijk

2019/12/07

Date last run: 07Dec2019

Introduction

I posted before about using magick and tesseract in postings here and here. Looking again at tesseract because the inputs were changed, I noticed two possibilities to improve the workflow:

When I used the tesseract::ocr_data, I needed to derive the line number from the bounding box, so I decided to go for the tesseract::ocr(HOCR=T) option. A part of the XHTML document is listed here:

<span class='ocr_line' id='line_1_5' title="bbox 19 227 1087 251; baseline 0 -8; x_size 14.113145; x_descenders 3.1131446; x_ascenders 3.6521802">
      <span class='ocrx_word' id='word_1_28' title='bbox 19 227 147 251; x_wconf 90'>Standaardbereik</span>
      <span class='ocrx_word' id='word_1_29' title='bbox 198 232 242 243; x_wconf 86'>1418</span>

On the internet I found a snippet by Jeroen Ooms that converts the XHTML document to a data.frame. I created a function from the snippet in which I also extracted the line number and made the extraction of the confidence rate conditional. In the previous blogs I used scans to extract the results of medical (blood) tests.

Including cleanup_bw, scan_with_hocr and extract_table in package

The functions used to do the extraction of the medical test results, were generalized for extracting a table from a scan. And being more general they were included in the R-package HOQCutil . Using among others these functions the total workflow then is:

The functions are (of course) described in the help of the HOQCutil package under the name of the function or under scanner_functions. The remainder of this blog shows how this works on an example file.

Example

We will try to scan the table in Figure 1 . We created the .PNG file ourselves (from an MS Excel sheet) so we could give it some ‘features’:

plot of chunk g1

Figure 1: Example image (in .PNG file)

Read the image from file

filename = 'example.PNG'
img1 = magick::image_read(filename) 

Cleanup the file

The result can be seen in Figure 2 .

cln_options1 = list(resize="4000x",trim=10,enhance=TRUE,sharpen=1) 
img2 = HOQCutil::cleanup_bw(img1,cln_options1) 
img2

plot of chunk unnamed-chunk-7

Figure 2: Example image after cleanup_bw processing

OCR the cleansed image

In this step the actual scanning takes place. The results are not too bad but the first and last (is this a coincidence?) row of the table are not correctly scanned. We will show the result (inTable 1) after the next step.

df1  = HOQCutil::scan_with_hocr(img2,add_header_cols = T) 

Prepare for extracting the table

In the previous step we added two columns (header_col and header_col_seq) to the scan results. We use these to indicate which elements will be the header of the table. This could be done interactively by using edit but here we do it by coding:

df1$header_col[c(9,10,11,12,13,14,15,16,17)]= c(1,2,3,4,4,1,2,3,4)

The resulting data.frame can be seen in Table 1:.

line fld_nr word x1 y1 y2 x2 header_col header_col_seq
1 1 Tnis 9 0 101 251 0 1
1 2 is 303 0 101 389 0 1
1 3 an 442 27 101 577 0 1
1 4 example 630 0 120 1143 0 1
1 5 of 1192 0 101 1311 0 1
1 6 a 1350 27 101 1415 0 1
1 7 table 1461 0 101 1765 0 1
1 8 (Verdana) 1815 0 120 2412 0 1
2 1 jan 1299 314 437 1481 1 1
2 2 feb 2005 310 441 2176 2 1
2 3 mar 2693 310 441 2908 3 1
2 4 first 3259 314 417 3497 4 1
2 5 quarter 3543 325 436 3992 4 1
3 1 2019 1253 478 609 1533 1 1
3 2 2019 1954 478 609 2223 2 1
3 3 2019 2665 478 609 2945 3 1
3 4 2019 3475 492 585 3767 4 1
4 1 city 575 646 777 789 0 1
4 2 101 1302 660 753 1495 0 1
4 3 102 2004 660 753 2216 0 1
4 4 103 2705 660 753 2917 0 1
4 5 306 3515 660 753 3728 0 1
5 1 1 591 819 911 643 0 1
6 1 city 575 957 1078 789 0 1
6 2 2 838 966 1059 902 0 1
6 3 203 1292 966 1059 1505 0 1
6 4 205 2695 966 1059 2916 0 1
6 5 408 3513 966 1059 3727 0 1
7 1 provinc 580 1115 1236 1018 0 1
7 2 complete 1118 1115 1236 1676 0 1
7 3 incomplete 1775 1115 1236 2436 0 1
7 4 complete 2531 1115 1236 3078 0 1
7 5 incomplete 3296 1115 1236 3947 0 1
8 1 el 581 1283 1373 762 0 1
Table 1: Result table after scanning cleansed image and inserting header info

Extract the table

The last step is done with the HOQCutil::extract_table function. The resulting table is listed in Table 2. We use desc_above=F because we see in Figure 1 (and Figure 2) that the descriptions are not above the data line. See e.g. the ‘1’ for the first city line and the ‘e 1’ for the last data line.

df2 = HOQCutil::extract_table(df1, headers=NULL, lastline = Inf, desc_above=F)

We also see in Table 2 that the description ‘provinc el’ is not correct. With a perfect scan this would have been ‘provinc e 1’, because descriptions are pasted with a space as separator. So in this case a small correction will have to be made. Also the NA (because no data was present in that cell) should be replaced by whatever the user thinks feasible.
Lesson: always check the results of the scan.

desc jan 2019 feb 2019 mar 2019 first quarter 2019
city 1 101 102 103 306
city 2 203 NA 205 408
provinc el complete incomplete complete incomplete
Table 2: Extracted table

Alternative for the extract

In the previous step we could use the column header_col that we filled in an earlier step with the information concerning headers. However we can also specify this information in list-form:

hdr_desc = list(
  list(c(2,1),c(3,1)),
  list(c(2,2),c(3,2)),
  list(c(2,3),c(3,3)),
  list(c(2,4),c(2,5),c(3,4))
)
df3 = HOQCutil::extract_table(df1,headers=hdr_desc, lastline = Inf, desc_above=T)

In Table 3 we see that this leads to the same headers. We also show the effect of specifying (wrongly) desc_above=T.

desc jan 2019 feb 2019 mar 2019 first quarter 2019
city 101 102 103 306
1 city 2 203 NA 205 408
provinc complete incomplete complete incomplete
Table 3: Extracted table with 'wrong' descriptions

Session Info

This document was produced on 07Dec2019 with the following R environment:

  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18362)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.3       knitr_1.26       xml2_1.2.2       magrittr_1.5    
  #>  [5] rappdirs_0.3.1   tidyselect_0.2.5 tesseract_4.1    HOQCutil_0.1.15 
  #>  [9] R6_2.4.1         rlang_0.4.2      stringr_1.4.0    highr_0.8       
  #> [13] dplyr_0.8.3      tools_3.6.0      xfun_0.10        ellipsis_0.3.0  
  #> [17] htmltools_0.4.0  digest_0.6.23    assertthat_0.2.1 lifecycle_0.1.0 
  #> [21] tibble_2.1.3     crayon_1.3.4     tidyr_1.0.0      purrr_0.3.3     
  #> [25] captioner_2.2.3  vctrs_0.2.0      zeallot_0.1.0    glue_1.3.1      
  #> [29] evaluate_0.14    rmarkdown_1.18   stringi_1.4.3    pillar_1.4.2    
  #> [33] compiler_3.6.0   backports_1.1.5  magick_2.2       pkgconfig_2.0.3