Using RSelenium to scrape a webpage

Han Oostdijk

2020/03/14

Date last run: 14Mar2020

In an RStudio Community message the question was raised how to retrieve a table from a webpage that was generated by javascript . The problem was that the page did not contain the table itself but only a reference to the javascript code. Because I was busy with a similar project, I decided to see if I could solve it. The suggestion to solve the problem was described in a stack overflow entry but it did not work out for the questioner and myself. In the entry camile mentioned the Selenium.

Therefore I decided to use the R package RSelenium. The following code extracts the table. The only problem is that it does not free the port. After running the code it is necessary to restart RStudio . A restart of the R session or closing the R project (when more sessions are open) is not enough to free the port. In my latest experiments I could no longer create this blog entry until I restarted the computer. I begin to see the attraction of Docker for these use cases.

HOQCutil::silent_library(c('RSelenium','rvest'))

rD <- rsDriver(browser = 'firefox',port=4567L,verbose=F) 
remDr <- rD[["client"]]

pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
remDr$navigate(url)

remDr$switchToFrame(NULL)
doc = xml2::read_html(remDr$getPageSource()[[1]])

df= rvest::html_table(doc)[[1]]

remDr$close()
# stop the selenium server
rD[["server"]]$stop() 
#> [1] TRUE
rm(rD)
gc(verbose=F)
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  799500 42.7    1561221 83.4  1134935 60.7
#> Vcells 1421543 10.9    8388608 64.0  2309339 17.7
# port is still in use (only after RStudio restart available again)

The table:

knitr::kable(df)
EPPOCode Name Type Language Preferred
SAPETR Saperda tridentata animal Scientific NA

Session Info

This document was produced on 14Mar2020 with the following R environment:

  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18363)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> other attached packages:
  #> [1] rvest_0.3.5     xml2_1.2.5      RSelenium_1.7.7
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.3       knitr_1.28       magrittr_1.5     rappdirs_0.3.1  
  #>  [5] HOQCutil_0.1.19  R6_2.4.1         rlang_0.4.5      highr_0.8       
  #>  [9] httr_1.4.1       stringr_1.4.0    caTools_1.17.1.1 tools_3.6.0     
  #> [13] xfun_0.10        binman_0.1.1     selectr_0.4-1    semver_0.2.0    
  #> [17] htmltools_0.4.0  askpass_1.1      yaml_2.2.0       openssl_1.4.1   
  #> [21] digest_0.6.23    assertthat_0.2.1 processx_3.4.1   purrr_0.3.3     
  #> [25] ps_1.3.0         bitops_1.0-6     curl_4.3         glue_1.3.1      
  #> [29] evaluate_0.14    wdman_0.2.5      rmarkdown_2.1    stringi_1.4.6   
  #> [33] compiler_3.6.0   XML_3.98-1.20    jsonlite_1.6.1