Scanning documents with Tesseract with HOCR

Date last run: 07Dec2019

Introduction

I posted before about using magick and tesseract in postings here and here. Looking again at tesseract because the inputs were changed, I noticed two possibilities to improve the workflow:

using the tesseract::ocr_data function that returns a data.frame with the scanned text and a confidence rate and a bounding box for this text
the HOCR argument of the tesseract::ocr function that returns an XHTML document with the same elements as tesseract::ocr_data but with an additional line number

When I used the tesseract::ocr_data, I needed to derive the line number from the bounding box, so I decided to go for the tesseract::ocr(HOCR=T) option. A part of the XHTML document is listed here:

<span class='ocr_line' id='line_1_5' title="bbox 19 227 1087 251; baseline 0 -8; x_size 14.113145; x_descenders 3.1131446; x_ascenders 3.6521802">
      <span class='ocrx_word' id='word_1_28' title='bbox 19 227 147 251; x_wconf 90'>Standaardbereik</span>
      <span class='ocrx_word' id='word_1_29' title='bbox 198 232 242 243; x_wconf 86'>1418</span>

On the internet I found a snippet by Jeroen Ooms that converts the XHTML document to a data.frame. I created a function from the snippet in which I also extracted the line number and made the extraction of the confidence rate conditional. In the previous blogs I used scans to extract the results of medical (blood) tests.

Including `cleanup_bw`, `scan_with_hocr` and `extract_table` in package

The functions used to do the extraction of the medical test results, were generalized for extracting a table from a scan. And being more general they were included in the R-package HOQCutil . Using among others these functions the total workflow then is:

read an image from file with magick::image_read
e.g. img1 = magick::image_read('example1.png')
define the list with cleanup options
e.g.cln_options1 = list(resize="4000x",trim=10,enhance=TRUE,sharpen=1)
use the HOQCutil::cleanup_bw function with this list
e.g. img2 = HOQCutil::cleanup_bw (img1,cln_options1)
scan (OCR) the cleansed image with HOQCutil::scan_with_hocr
e.g.df1 = HOQCutil::scan_with_hocr(img2,add_header_cols=F)
indicate in the columns of df1 which fields belong to the table headers (or alternatively define a headers list)
extract the table with the HOQCutil::extract_table function
e.g. df2= HOQCutil::extract_table(df1, headers=NULL,lastline = Inf, desc_above=T) or alternatively
df2= HOQCutil::extract_table(df1, headers=hdr_desc,lastline = Inf, desc_above=T)

The functions are (of course) described in the help of the HOQCutil package under the name of the function or under scanner_functions. The remainder of this blog shows how this works on an example file.

Example

We will try to scan the table in Figure 1 . We created the .PNG file ourselves (from an MS Excel sheet) so we could give it some ‘features’:

headers are placed over two lines and contain two words (except last header)
the last header contains three words
two descriptions are placed over two lines
we used the ‘Verdana’ font. With other fonts the scans were far worse!

plot of chunk g1

Figure 1: Example image (in .PNG file)

Read the image from file

filename = 'example.PNG'

img1 = magick::image_read(filename)

Cleanup the file

The result can be seen in Figure 2 .

cln_options1 = list(resize="4000x",trim=10,enhance=TRUE,sharpen=1) 
img2 = HOQCutil::cleanup_bw(img1,cln_options1) 
img2

plot of chunk unnamed-chunk-7

Figure 2: Example image after cleanup_bw processing

OCR the cleansed image

In this step the actual scanning takes place. The results are not too bad but the first and last (is this a coincidence?) row of the table are not correctly scanned. We will show the result (inTable 1) after the next step.

df1  = HOQCutil::scan_with_hocr(img2,add_header_cols = T)

Prepare for extracting the table

In the previous step we added two columns (header_col and header_col_seq) to the scan results. We use these to indicate which elements will be the header of the table. This could be done interactively by using edit but here we do it by coding:

df1$header_col[c(9,10,11,12,13,14,15,16,17)]= c(1,2,3,4,4,1,2,3,4)

The resulting data.frame can be seen in Table 1:.

line	fld_nr	word	x1	y1	y2	x2	header_col	header_col_seq
1	1	Tnis	9	0	101	251	0	1
1	2	is	303	0	101	389	0	1
1	3	an	442	27	101	577	0	1
1	4	example	630	0	120	1143	0	1
1	5	of	1192	0	101	1311	0	1
1	6	a	1350	27	101	1415	0	1
1	7	table	1461	0	101	1765	0	1
1	8	(Verdana)	1815	0	120	2412	0	1
2	1	jan	1299	314	437	1481	1	1
2	2	feb	2005	310	441	2176	2	1
2	3	mar	2693	310	441	2908	3	1
2	4	first	3259	314	417	3497	4	1
2	5	quarter	3543	325	436	3992	4	1
3	1	2019	1253	478	609	1533	1	1
3	2	2019	1954	478	609	2223	2	1
3	3	2019	2665	478	609	2945	3	1
3	4	2019	3475	492	585	3767	4	1
4	1	city	575	646	777	789	0	1
4	2	101	1302	660	753	1495	0	1
4	3	102	2004	660	753	2216	0	1
4	4	103	2705	660	753	2917	0	1
4	5	306	3515	660	753	3728	0	1
5	1	1	591	819	911	643	0	1
6	1	city	575	957	1078	789	0	1
6	2	2	838	966	1059	902	0	1
6	3	203	1292	966	1059	1505	0	1
6	4	205	2695	966	1059	2916	0	1
6	5	408	3513	966	1059	3727	0	1
7	1	provinc	580	1115	1236	1018	0	1
7	2	complete	1118	1115	1236	1676	0	1
7	3	incomplete	1775	1115	1236	2436	0	1
7	4	complete	2531	1115	1236	3078	0	1
7	5	incomplete	3296	1115	1236	3947	0	1
8	1	el	581	1283	1373	762	0	1

Table 1: Result table after scanning cleansed image and inserting header info

Extract the table

The last step is done with the HOQCutil::extract_table function. The resulting table is listed in Table 2. We use desc_above=F because we see in Figure 1 (and Figure 2) that the descriptions are not above the data line. See e.g. the ‘1’ for the first city line and the ‘e 1’ for the last data line.

df2 = HOQCutil::extract_table(df1, headers=NULL, lastline = Inf, desc_above=F)

We also see in Table 2 that the description ‘provinc el’ is not correct. With a perfect scan this would have been ‘provinc e 1’, because descriptions are pasted with a space as separator. So in this case a small correction will have to be made. Also the NA (because no data was present in that cell) should be replaced by whatever the user thinks feasible.
Lesson: always check the results of the scan.

desc	jan 2019	feb 2019	mar 2019	first quarter 2019
city 1	101	102	103	306
city 2	203	NA	205	408
provinc el	complete	incomplete	complete	incomplete

Table 2: Extracted table

Alternative for the extract

In the previous step we could use the column header_col that we filled in an earlier step with the information concerning headers. However we can also specify this information in list-form:

hdr_desc = list(
  list(c(2,1),c(3,1)),
  list(c(2,2),c(3,2)),
  list(c(2,3),c(3,3)),
  list(c(2,4),c(2,5),c(3,4))
)

df3 = HOQCutil::extract_table(df1,headers=hdr_desc, lastline = Inf, desc_above=T)

In Table 3 we see that this leads to the same headers. We also show the effect of specifying (wrongly) desc_above=T.

desc	jan 2019	feb 2019	mar 2019	first quarter 2019
city	101	102	103	306
1 city 2	203	NA	205	408
provinc	complete	incomplete	complete	incomplete

Table 3: Extracted table with 'wrong' descriptions

Session Info

This document was produced on 07Dec2019 with the following R environment:

  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18362)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.3       knitr_1.26       xml2_1.2.2       magrittr_1.5    
  #>  [5] rappdirs_0.3.1   tidyselect_0.2.5 tesseract_4.1    HOQCutil_0.1.15 
  #>  [9] R6_2.4.1         rlang_0.4.2      stringr_1.4.0    highr_0.8       
  #> [13] dplyr_0.8.3      tools_3.6.0      xfun_0.10        ellipsis_0.3.0  
  #> [17] htmltools_0.4.0  digest_0.6.23    assertthat_0.2.1 lifecycle_0.1.0 
  #> [21] tibble_2.1.3     crayon_1.3.4     tidyr_1.0.0      purrr_0.3.3     
  #> [25] captioner_2.2.3  vctrs_0.2.0      zeallot_0.1.0    glue_1.3.1      
  #> [29] evaluate_0.14    rmarkdown_1.18   stringi_1.4.3    pillar_1.4.2    
  #> [33] compiler_3.6.0   backports_1.1.5  magick_2.2       pkgconfig_2.0.3

Scanning documents with Tesseract with HOCR

Han Oostdijk

2019/12/07

Date last run: 07Dec2019

Introduction

Including `cleanup_bw`, `scan_with_hocr` and `extract_table` in package

Example

Read the image from file

Cleanup the file

OCR the cleansed image

Prepare for extracting the table

Extract the table

Alternative for the extract

Session Info

Scanning documents with Tesseract with HOCR

Han Oostdijk

2019/12/07

Date last run: 07Dec2019

Introduction

Including cleanup_bw, scan_with_hocr and extract_table in package

Example

Read the image from file

Cleanup the file

OCR the cleansed image

Prepare for extracting the table

Extract the table

Alternative for the extract

Session Info

Including `cleanup_bw`, `scan_with_hocr` and `extract_table` in package