Scraping links from a webpage

Han Oostdijk

2020/03/16

Date last run: 19Mar2020

In a previous blog entry I showed how to scrape a page that uses javascript to fill a table. Doing that I realized that this also explained why I saw a difference in two ways to display the html of a page:

In the first case the browser executes the javascript code and in the second the javascript code was not executed until I decided to use the R package RSelenium.

In this blog entry I read a webpage with and without using RSelenium. I then use the xml2 package to extract the links from the page. The number of links is very much different.

Load the packages we will use

HOQCutil::silent_library(c('xml2','magrittr','purrr','dplyr','tibble','RSelenium'))

Define function get_data

The function get_data will retrieve an html document from the internet of a local file depending on the argument use_cache :

get_data <-
  function(filename = NULL,
           use_cache = F,
           url = NULL,
           rD = NULL) {
    read_html2 <- function(url) {
      if (is.null(rD)) {
        xml2::read_html(url)
      } else if ("rsClientServer" %in% class(rD)) {
        remDr <- rD[["client"]]
        remDr$navigate(url)
        remDr$switchToFrame(NULL)
        doc = xml2::read_html(remDr$getPageSource()[[1]])
        remDr$close()
        rD[["server"]]$stop() 
        doc
      } else {
        doc = NULL
      }
    }
    if (use_cache == F) {
      doc = read_html2(url)
      if (!is.null(filename)) {
        xml2::write_html(doc, filename)
      }
    } else {
      doc = xml2::read_html(filename)
    }
    doc
  }

Define function unpack_a

This function extract the ‘description’ and ‘link’ part of an anchor element . From the element <a class="latest-column" href="https://www.globaltimes.cn/opinion/viewpoint/">VIEWPOINT</a> it will extract ‘VIEWPOINT’ and ‘https://www.globaltimes.cn/opinion/viewpoint/' .

unpack_a <- function (doc) {
  if (length(doc)>0) {
    c1 = xml2::xml_text(doc) 
    c2 = xml2::xml_attr(doc, 'href') 
    c1 = ifelse(is.null(c1),'',c1)
    c2 = ifelse(is.null(c2),'',c2)
  } else {
    c1 = ''
    c2 = ''
  }
  tibble::tibble(c1 = c1, c2 = c2)
}

Retrieve a page without RSelenium

doc1   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = NULL)
doc1_a = xml2::xml_find_all(doc1, "//a")
xx1    = purrr::map_dfr(doc1_a,unpack_a) 
dim(xx1)
#> [1] 46  2
print(xx1)
#> # A tibble: 46 x 2
#>    c1                                 c2                                        
#>    <chr>                              <chr>                                     
#>  1 ""                                 https://www.globaltimes.cn/               
#>  2 " "                                https://www.globaltimes.cn//special-cover~
#>  3 "OP-ED"                            https://www.globaltimes.cn/opinion/       
#>  4 "EDITORIAL"                        https://www.globaltimes.cn/opinion/editor~
#>  5 "US government loses decency unde~ https://www.globaltimes.cn/content/118319~
#>  6 "OBSERVER"                         https://www.globaltimes.cn/opinion/observ~
#>  7 "Trump puts Asian Americans at ri~ https://www.globaltimes.cn/content/118319~
#>  8 "VIEWPOINT"                        https://www.globaltimes.cn/opinion/viewpo~
#>  9 "Germans hear little of China's a~ https://www.globaltimes.cn/content/118314~
#> 10 "GT VOICE"                         https://www.globaltimes.cn/source/GT-Voic~
#> # ... with 36 more rows

We see that we retrieve in this way 46 anchor elements.

Retrieve a page with RSelenium

The difference with the previous section is only the creation of the rD object and its use in the get_data call. And afterwards we try to get rid of the RSelenium resources. This succeeds with the annoying exception of the port number (in this case 4568) that is not freed (as described in Using RSelenium to scrape a webpage)

selenium_port = 4568L
rD = rsDriver(browser = 'firefox', port = selenium_port, verbose = F) 

doc2   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = rD)
doc2_a = xml2::xml_find_all(doc2, "//a")
xx2    = purrr::map_dfr(doc2_a,unpack_a)
dim(xx2)
#> [1] 467   2
print(xx2)
#> # A tibble: 467 x 2
#>    c1                                          c2                               
#>    <chr>                                       <chr>                            
#>  1 " "                                         https://www.globaltimes.cn/      
#>  2 "\n    "                                    https://www.globaltimes.cn/      
#>  3 " \n       \n       load_file(\"/includes/~ https://www.globaltimes.cn/      
#>  4 " "                                         https://www.globaltimes.cn/      
#>  5 "E-Paper"                                   http://epaper.globaltimes.cn     
#>  6 "Mobile"                                    http://mobile.globaltimes.cn     
#>  7 "Apps"                                      http://www.globaltimes.cn/conten~
#>  8 "Sina Weibo"                                http://weibo.com/globaltimescn/  
#>  9 "Facebook"                                  http://www.facebook.com/globalti~
#> 10 "Twitter"                                   https://twitter.com/globaltimesn~
#> # ... with 457 more rows

rm(rD)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  868429 46.4    1743240 93.1  1164778 62.3
#> Vcells 1528778 11.7    8388608 64.0  2363835 18.1

We see that we retrieve in this way 467 anchor elements.

Summary

With little additional effort one can extract more data (the whole page?) of a webpage by using the RSelenium package.

Session Info

This document was produced on 19Mar2020 with the following R environment:

  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18363)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> other attached packages:
  #> [1] RSelenium_1.7.7 tibble_2.1.3    dplyr_0.8.5     purrr_0.3.3    
  #> [5] magrittr_1.5    xml2_1.2.5     
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.4       pillar_1.4.3     compiler_3.6.0   bitops_1.0-6    
  #>  [5] tools_3.6.0      digest_0.6.25    jsonlite_1.6.1   evaluate_0.14   
  #>  [9] pkgconfig_2.0.3  rlang_0.4.5      cli_2.0.2        curl_4.3        
  #> [13] yaml_2.2.0       xfun_0.10        binman_0.1.1     httr_1.4.1      
  #> [17] stringr_1.4.0    knitr_1.28       rappdirs_0.3.1   vctrs_0.2.4     
  #> [21] askpass_1.1      caTools_1.17.1.1 tidyselect_1.0.0 glue_1.3.2      
  #> [25] R6_2.4.1         processx_3.4.1   fansi_0.4.1      XML_3.98-1.20   
  #> [29] rmarkdown_2.1    semver_0.2.0     ps_1.3.0         htmltools_0.4.0 
  #> [33] assertthat_0.2.1 utf8_1.1.4       stringi_1.4.6    openssl_1.4.1   
  #> [37] HOQCutil_0.1.19  wdman_0.2.5      crayon_1.3.4