Scanning documents with Tesseract after preprocessing with Magick (again)

Han Oostdijk

2019/08/28

Date last run: 15Sep2019

Since I posted about this subject new versions of the magick and tesseract packages became available. I will try to redo the previous analysis. That is: I want to see if I can use the Tesseract 4 engine with a whitelist.

Prepare for scanning (OCR)

filename = 'uitslag1.png'
img      = magick::image_read(filename) 
magick::image_info(img)
#> # A tibble: 1 x 7
#>   format width height colorspace matte filesize density
#>   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
#> 1 PNG     1145    374 sRGB       TRUE     27658 38x38

plot of chunk g1

Figure 1: part of image

Do the scan (OCR)

The magick::image_ocr function uses the tesseract package to do the actual scan. According to the vignette : “The package provides R bindings for Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages.” In this post I only used standard language English (‘eng’).

Use Tesseract 4 engine

Triggered by the underscore I tried to find a way to specify to the ocr engine which characters it should recognize. This ‘whitelist’ functionality is now again available. So it is no longer necessary to specify tessedit_ocr_engine_mode='0'.

In the previous post we started using magick::image_ocr(img,language='nld') and magick::image_ocr(img,language='eng') that call the tesseract::ocr function with the default engine (Tesseract 4) and the given language. Now we will explicitly use the Tesseract functions. So we define the engine engine4 with the tesseract::tesseract function. We also specify (redundantly) the datapath to the version 4 files.

tesseract4 = "C:\\Users\\Han\\AppData\\Local\\tesseract4\\tesseract4\\tessdata"
whitelist  = "abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 -()',.</"
tess_opts  = list(tessedit_char_whitelist = whitelist)
engine4    = tesseract::tesseract(language='eng', datapath = tesseract4, options = tess_opts) 

`%>%` = magrittr::`%>%`
txt        = tesseract::ocr(img, engine = engine4) %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

stringr::str_sub(txt[5],61,-1)  # (part of) line 5 scanned with eng engine
#> [1] "8 25-7-18 12-11-18 243-19 27-5-19"
stringr::str_sub(txt[6],51,-1)  # (part of) line 6 scanned with eng engine
#> [1] " 1,25 114 113 0,87"

We see that this is still not good enough. See Figure 1 . Just as in the earlier post we need to improve the results by preprocessing the image before doing the actual scan.

Preprocess the image.

The package magick has a lot of functions to handle images. I combined some of these in a function to improve the readability of the image. Because I did not know beforehand which of them I would use, I parametrised the function with a specification list.

clean_up <- function (img,myoptions) {
	force(myoptions)
	if (!is.null(myoptions$trim)) {
		img = magick::image_trim(img,fuzz = myoptions$trim)
	}
	if (!is.null(myoptions$resize)) {
		img = magick::image_resize(img,myoptions$resize)
	}
	if (!is.null(myoptions$brightness)) {
		brightness = myoptions$brightness
	} else {
		brightness = 100
	}
	if (!is.null(myoptions$saturation)) {
		saturation = myoptions$saturation
	} else {
		saturation = 100
	}
	if (!is.null(myoptions$hue)) {
		hue = myoptions$hue
	} else {
		hue = 100
	}
	img = magick::image_modulate(img,
		brightness=brightness, saturation=saturation, hue=hue)
	if (!is.null(myoptions$sharpen))  {
		img = magick::image_contrast(img,sharpen=myoptions$sharpen)
	}
	img = magick::image_background(
		magick::image_transparent(img, 'white', fuzz = 25), 'white')
	img = magick::image_quantize(img,colorspace ="gray")
	img = magick::image_background(
		magick::image_transparent(img, 'black', fuzz =75), 'black')
	if ( (!is.null(myoptions$enhance)) && myoptions$enhance == TRUE) {
		img = magick::image_enhance(img)
	}
	img
}

Clean the image with the following parameters

clean_options = list(resize="4000x",convert_type='Grayscale',
	trim=10,enhance=TRUE,sharpen=1)
img2 = clean_up(img, clean_options)

The relevant part of the image then looks like Figure 2 :

plot of chunk g2

Figure 2: part of image (cleansed)

Scan the cleansed image with Tesseract 4

txt        = tesseract::ocr(img2, engine = engine4) %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

stringr::str_sub(txt[5],63,-1)  # line 5 scanned with eng engine
#> [1] " 25-17-18 12-11-18 24-3-19 27-5-19"
stringr::str_sub(txt[6],51,-1)  # line 6 scanned with eng engine
#> [1] " 1,25 1,14 1,13 0,87"

All characters are now correctly converted. For privacy reasons only about 25% of the document was shown but all dates and numbers were converted correctly for this document (and 16 others with the same characteristics).

Conclusion

It was necessary to clean the image to get a good scan of the image.
However there is no need to fall back to an older version of Tesseract to use a whitelist.

Session Info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] HOQCutil_0.1.10 jsonlite_1.6    glue_1.3.1      purrr_0.3.2    
#>  [5] xml2_1.2.2      ggspatial_1.0.3 ggplot2_3.2.1   sf_0.7-7       
#>  [9] dplyr_0.8.3     stringr_1.4.0   osmdata_0.1.1  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2         lubridate_1.7.4    lattice_0.20-38   
#>  [4] tidyr_1.0.0        png_0.1-7          class_7.3-15      
#>  [7] assertthat_0.2.1   zeallot_0.1.0      digest_0.6.20     
#> [10] utf8_1.1.4         R6_2.4.0           cellranger_1.1.0  
#> [13] plyr_1.8.4         backports_1.1.4    evaluate_0.14     
#> [16] e1071_1.7-0        httr_1.4.1         highr_0.8         
#> [19] blogdown_0.15      pillar_1.4.2       rlang_0.4.0       
#> [22] lazyeval_0.2.1     curl_4.0           readxl_1.3.1      
#> [25] magick_2.2         rmarkdown_1.15     rgdal_1.4-4       
#> [28] munsell_0.5.0      rosm_0.2.5         compiler_3.6.0    
#> [31] xfun_0.8           pkgconfig_2.0.2    prettymapr_0.2.2  
#> [34] htmltools_0.3.6    tidyselect_0.2.5   tibble_2.1.3      
#> [37] fansi_0.4.0        crayon_1.3.4       withr_2.1.2       
#> [40] rappdirs_0.3.1     grid_3.6.0         gtable_0.3.0      
#> [43] lifecycle_0.1.0    DBI_1.0.0          magrittr_1.5      
#> [46] units_0.6-2        scales_1.0.0       KernSmooth_2.23-15
#> [49] cli_1.1.0          stringi_1.4.3      fs_1.3.1          
#> [52] sp_1.3-1           vctrs_0.2.0        captioner_2.2.3   
#> [55] tools_3.6.0        tesseract_4.1      colorspace_1.4-1  
#> [58] classInt_0.3-3     rvest_0.3.4        knitr_1.24