Scanning documents with Tesseract after preprocessing with Magick

Han Oostdijk

2019/06/28

Date last run: 27Aug2019

From a website I could download the results of various laboratory tests on blood samples. Apart from the latest result each document also lists the previous ones.

The best way I could find to download these documents was to use the Microsoft Screenshot Snipping Tool and save the snip as a png file.

In the remainder I describe how I handled these files with the R packages magick and tesseract. For details about these packages see their reference manuals (magick resp. tesseract) and vignettes (magick resp. tesseract)

Scanning with Magick

The easiest way to extract the test results from the png file is to use the image_ocr function of the magick package. This function has a magick image as input, so we start by reading the png file into an image object.

Prepare for scanning (OCR)

filename = 'uitslag1.png'
img      = magick::image_read(filename) 

With the print function the image is shown in a graphical window and the characteristics of the image are shown. The latter can also be requested with the magick::image_info function.

# print(img)
magick::image_info(img)
#>   format width height colorspace matte filesize density
#> 1    PNG  1145    374       sRGB  TRUE    27658   38x38

To give an impression of the image and to be able to show what, in first instance, went wrong while scanning the image, we show part of the image:

plot of chunk g1

Figure 1: part of image

Do the scan (OCR)

The magick::image_ocr function uses the tesseract package to do the actual scan. According to the vignette : “The package provides R bindings for Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages.” Because I recently installed Tesseract with language Dutch (‘nld’) in addition to the standard language English (‘eng’) I have tried these two languages.

`%>%` = magrittr::`%>%`
txt1 = magick::image_ocr(img,language='nld') %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

txt2 = magick::image_ocr(img,language='eng')  %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

Compare the result of the engines

In the next code section I show the fifth (date line) and sixth line (test results) because these contain information I am interested in. In Figure 1 you saw part of lines 3 (without relevant data), 5 and 6.

c( 
  stringr::str_sub(txt1[5],63,-1),  # (part of) line 5 scanned with nld engine
  stringr::str_sub(txt2[5],62,-1) ) # (part of) line 5 scanned with eng engine
#> [1] " 25-71-18 12-11-18 24-3-19 27-519" " 25-7-18 12-11-18 243-19 27-5-19"
c(
  stringr::str_sub(txt1[6],51,-1),  # (part of) line 6 scanned with nld engine
  stringr::str_sub(txt2[6],51,-1) ) # (part of) line 6 scanned with eng engine
#> [1] " 1,25 114 113 _ 0,87" " 1,25 114 113 0,87"

So we see that the engines for both languages produce results that differ from the image: e.g ‘1.14’ is read as ‘114’. We also see that the two engines deliver different outcomes: see the last two dates (‘24-3-19’ versus ‘243-19’ and ‘27-519’ versus ‘27-5-19’) and the underscore character in the ‘nld’ scan of the sixth line.

Use Tesseract 3 engine

Triggered by the underscore I tried to find a way to specify to the ocr engine which characters it should recognize. This ‘whitelist’ functionality is indeed available but no longer with the default setting of the current version of Tesseract. However in Tesseract 4 one can specify that the older functionality is usable by setting tessedit_ocr_engine_mode='0'.

Note that in that case one should use the version 3 training data. For that version I have no ‘nld’ training data available, so I will use the default ‘eng’ data. NB. training data for all languages and versions can be downloaded from the Tesseract Wiki. In the variables tesseract3 and tesseract4 I specify the location of the folder for the training data for the two versions.

Up till now we used magick::image_ocr(img,language='nld') and magick::image_ocr(img,language='eng') that call the tesseract::ocr function with the default engine (Tesseract 4) and the given language. Now we will explicitly use the Tesseract functions. First we define the engines engine3 and engine4 for the two Tesseract versions (with the tesseract::tesseract function). And then we will use engine3 to ocr the image (with the tesseract::ocr function). engine4 will be used later on.

tesseract3 = "C:\\Users\\Han\\AppData\\Local\\tesseract\\tesseract\\tessdata"
tesseract4 = "C:\\Users\\Han\\AppData\\Local\\tesseract4\\tesseract4\\tessdata"

whitelist  = "abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 -()',.</"
tess_opts  = list(tessedit_ocr_engine_mode='0',tessedit_char_whitelist = whitelist)
engine3    = tesseract::tesseract(language='eng', datapath = tesseract3, options = tess_opts)
engine4    = tesseract::tesseract(language='eng', datapath = tesseract4)  

txt        = tesseract::ocr(img, engine = engine3) %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

stringr::str_sub(txt[5],61,-1)  # (part of) line 5 scanned with eng engine
#> [1] "7577718 12711718 2473719 2775719"
stringr::str_sub(txt[6],51,-1)  # (part of) line 6 scanned with eng engine
#> [1] " 1,25 1,14 1,13 0,37"

We see that this is no improvement to the Tesseract 4 result. Looking at Figure 1 we see that ‘-’ in the dates is read as ‘7’ and some ‘8’ characters are scanned as ‘3’. We will try to improve the results we preprocessing the image before doing the actual scan.

Preprocess the image.

The package magick has a lot of functions to handle images. I combined some of these in a function to improve the readability of the image. Because I did not know beforehand which of them I would use, I parametrised the function with a specification list.

clean_up <- function (img,myoptions) {
	force(myoptions)
	if (!is.null(myoptions$trim)) {
		img = magick::image_trim(img,fuzz = myoptions$trim)
	}
	if (!is.null(myoptions$resize)) {
		img = magick::image_resize(img,myoptions$resize)
	}
	if (!is.null(myoptions$brightness)) {
		brightness = myoptions$brightness
	} else {
		brightness = 100
	}
	if (!is.null(myoptions$saturation)) {
		saturation = myoptions$saturation
	} else {
		saturation = 100
	}
	if (!is.null(myoptions$hue)) {
		hue = myoptions$hue
	} else {
		hue = 100
	}
	img = magick::image_modulate(img,
		brightness=brightness, saturation=saturation, hue=hue)
	if (!is.null(myoptions$sharpen))  {
		img = magick::image_contrast(img,sharpen=myoptions$sharpen)
	}
	img = magick::image_background(
		magick::image_transparent(img, 'white', fuzz = 25), 'white')
	img = magick::image_quantize(img,colorspace ="gray")
	img = magick::image_background(
		magick::image_transparent(img, 'black', fuzz =75), 'black')
	if ( (!is.null(myoptions$enhance)) && myoptions$enhance == TRUE) {
		img = magick::image_enhance(img)
	}
	img
}

Clean the image with the following parameters

clean_options = list(resize="4000x",convert_type='Grayscale',
	trim=10,enhance=TRUE,sharpen=1)
img2 = clean_up(img, clean_options)

The relevant part of the image then looks like Figure 2 :

plot of chunk g2

Figure 2: part of image (cleansed)

Scan the cleansed image with Tesseract 3

txt        = tesseract::ocr(img2, engine = engine3) %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

stringr::str_sub(txt[5],63,-1)  # line 5 scanned with eng engine
#> [1] " 25-7-18 12-11-18 24-3-19 27-5-19"
stringr::str_sub(txt[6],51,-1)  # line 6 scanned with eng engine
#> [1] " 1,25 1,14 1,13 0,87"

All characters are now correctly converted. For privacy reasons only about 25% of the document was shown but all dates and numbers were converted correctly for this document (and 16 others with the same characteristics).

Scan the cleansed image with Tesseract 4

Could this be done with Tesseract 4?

txt        = tesseract::ocr(img2, engine = engine4) %>%
       stringr::str_split_fixed(., '\n', Inf) %>%
       as.character(.)

stringr::str_sub(txt[5],63,-1)  # line 5 scanned with eng engine
#> [1] " 25-17-18 12-11-18 24-3-19 27-5-19"
stringr::str_sub(txt[6],51,-1)  # line 6 scanned with eng engine
#> [1] " 1,25 1,14 1,13 0,87"

We see that Tesseract 4 gives the same result as Tesseract 3 for this document. However not all other document could be processed without errors.

Conclusion

It was necessary to clean the image to get a good scan of the image.

In this case the two versions gave the same result on the cleansed image. In tests with documents of other blood tests, the Tesseract 4 engine sometimes gave a small number of errors . That is why I will work with the combination of the cleaning function and Tesseract 3 for this set of documents.

Session Info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2      digest_0.6.20   rappdirs_0.3.1  magrittr_1.5   
#>  [5] evaluate_0.14   highr_0.8       rlang_0.4.0     stringi_1.4.3  
#>  [9] magick_2.1      captioner_2.2.3 tools_3.6.0     stringr_1.4.0  
#> [13] glue_1.3.1      purrr_0.3.2     xfun_0.8        tesseract_4.1  
#> [17] compiler_3.6.0  knitr_1.24