Date last run: 19Mar2020
In a previous blog entry I showed how to scrape a page that uses javascript to fill a table. Doing that I realized that this also explained why I saw a difference in two ways to display the html of a page:
- display the page in the (Firefox) browser and use 
File | Save As - load the page with 
xml2::read_htmland save it withxml2::write_html 
In the first case the browser executes the javascript code and in the second the javascript code was not executed until I decided to use the R package RSelenium.
In this blog entry I read a webpage with and without using RSelenium. I then use the xml2 package to extract the links from the page. The number of links is very much different.
Load the packages we will use
HOQCutil::silent_library(c('xml2','magrittr','purrr','dplyr','tibble','RSelenium'))
Define function get_data
The function get_data will retrieve an html document from the internet of a local file depending on the argument use_cache :
use_cache == T: thehtmldocument is read from the local file with namefilenameuse_cache == F: thehtmldocument is read from the internet resource namedurl. In this case thehtmldocument is written to the local file with namefilename(when specified). IfrD == NULL: (i.e.is.null(rD) == T) the normalxml2::read_htmlis usedrD != NULL: (i.e.is.null(rD) != T) andrDis arsClientServerobject the packageRSeleniumis used to read the page
get_data <-
  function(filename = NULL,
           use_cache = F,
           url = NULL,
           rD = NULL) {
    read_html2 <- function(url) {
      if (is.null(rD)) {
        xml2::read_html(url)
      } else if ("rsClientServer" %in% class(rD)) {
        remDr <- rD[["client"]]
        remDr$navigate(url)
        remDr$switchToFrame(NULL)
        doc = xml2::read_html(remDr$getPageSource()[[1]])
        remDr$close()
        rD[["server"]]$stop() 
        doc
      } else {
        doc = NULL
      }
    }
    if (use_cache == F) {
      doc = read_html2(url)
      if (!is.null(filename)) {
        xml2::write_html(doc, filename)
      }
    } else {
      doc = xml2::read_html(filename)
    }
    doc
  }
Define function unpack_a
This function extract the ‘description’ and ‘link’ part of an anchor element .
From the element <a class="latest-column" href="https://www.globaltimes.cn/opinion/viewpoint/">VIEWPOINT</a> it will extract ‘VIEWPOINT’ and ‘https://www.globaltimes.cn/opinion/viewpoint/' .
unpack_a <- function (doc) {
  if (length(doc)>0) {
    c1 = xml2::xml_text(doc) 
    c2 = xml2::xml_attr(doc, 'href') 
    c1 = ifelse(is.null(c1),'',c1)
    c2 = ifelse(is.null(c2),'',c2)
  } else {
    c1 = ''
    c2 = ''
  }
  tibble::tibble(c1 = c1, c2 = c2)
}
Retrieve a page without RSelenium
doc1   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = NULL)
doc1_a = xml2::xml_find_all(doc1, "//a")
xx1    = purrr::map_dfr(doc1_a,unpack_a) 
dim(xx1)
#> [1] 46  2
print(xx1)
#> # A tibble: 46 x 2
#>    c1                                 c2                                        
#>    <chr>                              <chr>                                     
#>  1 ""                                 https://www.globaltimes.cn/               
#>  2 " "                                https://www.globaltimes.cn//special-cover~
#>  3 "OP-ED"                            https://www.globaltimes.cn/opinion/       
#>  4 "EDITORIAL"                        https://www.globaltimes.cn/opinion/editor~
#>  5 "US government loses decency unde~ https://www.globaltimes.cn/content/118319~
#>  6 "OBSERVER"                         https://www.globaltimes.cn/opinion/observ~
#>  7 "Trump puts Asian Americans at ri~ https://www.globaltimes.cn/content/118319~
#>  8 "VIEWPOINT"                        https://www.globaltimes.cn/opinion/viewpo~
#>  9 "Germans hear little of China's a~ https://www.globaltimes.cn/content/118314~
#> 10 "GT VOICE"                         https://www.globaltimes.cn/source/GT-Voic~
#> # ... with 36 more rows
We see that we retrieve in this way 46 anchor elements.
Retrieve a page with RSelenium
The difference with the previous section is only the creation of the rD object and its use in the get_data call. And afterwards we try to get rid of the RSelenium resources. This succeeds with the annoying exception of the port number (in this case 4568) that is not freed (as described in
Using RSelenium to scrape a webpage)
selenium_port = 4568L
rD = rsDriver(browser = 'firefox', port = selenium_port, verbose = F) 
doc2   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = rD)
doc2_a = xml2::xml_find_all(doc2, "//a")
xx2    = purrr::map_dfr(doc2_a,unpack_a)
dim(xx2)
#> [1] 467   2
print(xx2)
#> # A tibble: 467 x 2
#>    c1                                          c2                               
#>    <chr>                                       <chr>                            
#>  1 " "                                         https://www.globaltimes.cn/      
#>  2 "\n    "                                    https://www.globaltimes.cn/      
#>  3 " \n       \n       load_file(\"/includes/~ https://www.globaltimes.cn/      
#>  4 " "                                         https://www.globaltimes.cn/      
#>  5 "E-Paper"                                   http://epaper.globaltimes.cn     
#>  6 "Mobile"                                    http://mobile.globaltimes.cn     
#>  7 "Apps"                                      http://www.globaltimes.cn/conten~
#>  8 "Sina Weibo"                                http://weibo.com/globaltimescn/  
#>  9 "Facebook"                                  http://www.facebook.com/globalti~
#> 10 "Twitter"                                   https://twitter.com/globaltimesn~
#> # ... with 457 more rows
rm(rD)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  868429 46.4    1743240 93.1  1164778 62.3
#> Vcells 1528778 11.7    8388608 64.0  2363835 18.1
We see that we retrieve in this way 467 anchor elements.
Summary
With little additional effort one can extract more data (the whole page?) of a webpage by using the RSelenium package.
Session Info
This document was produced on 19Mar2020 with the following R environment:
  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18363)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> other attached packages:
  #> [1] RSelenium_1.7.7 tibble_2.1.3    dplyr_0.8.5     purrr_0.3.3    
  #> [5] magrittr_1.5    xml2_1.2.5     
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.4       pillar_1.4.3     compiler_3.6.0   bitops_1.0-6    
  #>  [5] tools_3.6.0      digest_0.6.25    jsonlite_1.6.1   evaluate_0.14   
  #>  [9] pkgconfig_2.0.3  rlang_0.4.5      cli_2.0.2        curl_4.3        
  #> [13] yaml_2.2.0       xfun_0.10        binman_0.1.1     httr_1.4.1      
  #> [17] stringr_1.4.0    knitr_1.28       rappdirs_0.3.1   vctrs_0.2.4     
  #> [21] askpass_1.1      caTools_1.17.1.1 tidyselect_1.0.0 glue_1.3.2      
  #> [25] R6_2.4.1         processx_3.4.1   fansi_0.4.1      XML_3.98-1.20   
  #> [29] rmarkdown_2.1    semver_0.2.0     ps_1.3.0         htmltools_0.4.0 
  #> [33] assertthat_0.2.1 utf8_1.1.4       stringi_1.4.6    openssl_1.4.1   
  #> [37] HOQCutil_0.1.19  wdman_0.2.5      crayon_1.3.4