Scraping links from a webpage

Date last run: 19Mar2020

In a previous blog entry I showed how to scrape a page that uses javascript to fill a table. Doing that I realized that this also explained why I saw a difference in two ways to display the html of a page:

display the page in the (Firefox) browser and use File | Save As
load the page with xml2::read_html and save it with xml2::write_html

In the first case the browser executes the javascript code and in the second the javascript code was not executed until I decided to use the R package RSelenium.

In this blog entry I read a webpage with and without using RSelenium. I then use the xml2 package to extract the links from the page. The number of links is very much different.

Load the packages we will use

HOQCutil::silent_library(c('xml2','magrittr','purrr','dplyr','tibble','RSelenium'))

Define function `get_data`

The function get_data will retrieve an html document from the internet of a local file depending on the argument use_cache :

use_cache == T : the html document is read from the local file with name filename
use_cache == F : the html document is read from the internet resource named url. In this case the html document is written to the local file with name filename (when specified). If
- rD == NULL : (i.e. is.null(rD) == T) the normal xml2::read_html is used
- rD != NULL : (i.e. is.null(rD) != T) and rD is a rsClientServer object the package RSelenium is used to read the page

get_data <-
  function(filename = NULL,
           use_cache = F,
           url = NULL,
           rD = NULL) {
    read_html2 <- function(url) {
      if (is.null(rD)) {
        xml2::read_html(url)
      } else if ("rsClientServer" %in% class(rD)) {
        remDr <- rD[["client"]]
        remDr$navigate(url)
        remDr$switchToFrame(NULL)
        doc = xml2::read_html(remDr$getPageSource()[[1]])
        remDr$close()
        rD[["server"]]$stop() 
        doc
      } else {
        doc = NULL
      }
    }
    if (use_cache == F) {
      doc = read_html2(url)
      if (!is.null(filename)) {
        xml2::write_html(doc, filename)
      }
    } else {
      doc = xml2::read_html(filename)
    }
    doc
  }

Define function `unpack_a`

This function extract the ‘description’ and ‘link’ part of an anchor element . From the element <a class="latest-column" href="https://www.globaltimes.cn/opinion/viewpoint/">VIEWPOINT</a> it will extract ‘VIEWPOINT’ and ‘https://www.globaltimes.cn/opinion/viewpoint/' .

unpack_a <- function (doc) {
  if (length(doc)>0) {
    c1 = xml2::xml_text(doc) 
    c2 = xml2::xml_attr(doc, 'href') 
    c1 = ifelse(is.null(c1),'',c1)
    c2 = ifelse(is.null(c2),'',c2)
  } else {
    c1 = ''
    c2 = ''
  }
  tibble::tibble(c1 = c1, c2 = c2)
}

Retrieve a page without RSelenium

doc1   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = NULL)
doc1_a = xml2::xml_find_all(doc1, "//a")
xx1    = purrr::map_dfr(doc1_a,unpack_a) 
dim(xx1)
#> [1] 46  2
print(xx1)
#> # A tibble: 46 x 2
#>    c1                                 c2                                        
#>    <chr>                              <chr>                                     
#>  1 ""                                 https://www.globaltimes.cn/               
#>  2 " "                                https://www.globaltimes.cn//special-cover~
#>  3 "OP-ED"                            https://www.globaltimes.cn/opinion/       
#>  4 "EDITORIAL"                        https://www.globaltimes.cn/opinion/editor~
#>  5 "US government loses decency unde~ https://www.globaltimes.cn/content/118319~
#>  6 "OBSERVER"                         https://www.globaltimes.cn/opinion/observ~
#>  7 "Trump puts Asian Americans at ri~ https://www.globaltimes.cn/content/118319~
#>  8 "VIEWPOINT"                        https://www.globaltimes.cn/opinion/viewpo~
#>  9 "Germans hear little of China's a~ https://www.globaltimes.cn/content/118314~
#> 10 "GT VOICE"                         https://www.globaltimes.cn/source/GT-Voic~
#> # ... with 36 more rows

We see that we retrieve in this way 46 anchor elements.

Retrieve a page with RSelenium

The difference with the previous section is only the creation of the rD object and its use in the get_data call. And afterwards we try to get rid of the RSelenium resources. This succeeds with the annoying exception of the port number (in this case 4568) that is not freed (as described in Using RSelenium to scrape a webpage)

selenium_port = 4568L

rD = rsDriver(browser = 'firefox', port = selenium_port, verbose = F) 

doc2   = get_data(filename = NULL, use_cache = F,
                  url = 'http://www.globaltimes.cn/', rD = rD)
doc2_a = xml2::xml_find_all(doc2, "//a")
xx2    = purrr::map_dfr(doc2_a,unpack_a)
dim(xx2)
#> [1] 467   2
print(xx2)
#> # A tibble: 467 x 2
#>    c1                                          c2                               
#>    <chr>                                       <chr>                            
#>  1 " "                                         https://www.globaltimes.cn/      
#>  2 "\n    "                                    https://www.globaltimes.cn/      
#>  3 " \n       \n       load_file(\"/includes/~ https://www.globaltimes.cn/      
#>  4 " "                                         https://www.globaltimes.cn/      
#>  5 "E-Paper"                                   http://epaper.globaltimes.cn     
#>  6 "Mobile"                                    http://mobile.globaltimes.cn     
#>  7 "Apps"                                      http://www.globaltimes.cn/conten~
#>  8 "Sina Weibo"                                http://weibo.com/globaltimescn/  
#>  9 "Facebook"                                  http://www.facebook.com/globalti~
#> 10 "Twitter"                                   https://twitter.com/globaltimesn~
#> # ... with 457 more rows

rm(rD)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  868429 46.4    1743240 93.1  1164778 62.3
#> Vcells 1528778 11.7    8388608 64.0  2363835 18.1

We see that we retrieve in this way 467 anchor elements.

Summary

With little additional effort one can extract more data (the whole page?) of a webpage by using the RSelenium package.

Session Info

This document was produced on 19Mar2020 with the following R environment:

  #> R version 3.6.0 (2019-04-26)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18363)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> other attached packages:
  #> [1] RSelenium_1.7.7 tibble_2.1.3    dplyr_0.8.5     purrr_0.3.3    
  #> [5] magrittr_1.5    xml2_1.2.5     
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.4       pillar_1.4.3     compiler_3.6.0   bitops_1.0-6    
  #>  [5] tools_3.6.0      digest_0.6.25    jsonlite_1.6.1   evaluate_0.14   
  #>  [9] pkgconfig_2.0.3  rlang_0.4.5      cli_2.0.2        curl_4.3        
  #> [13] yaml_2.2.0       xfun_0.10        binman_0.1.1     httr_1.4.1      
  #> [17] stringr_1.4.0    knitr_1.28       rappdirs_0.3.1   vctrs_0.2.4     
  #> [21] askpass_1.1      caTools_1.17.1.1 tidyselect_1.0.0 glue_1.3.2      
  #> [25] R6_2.4.1         processx_3.4.1   fansi_0.4.1      XML_3.98-1.20   
  #> [29] rmarkdown_2.1    semver_0.2.0     ps_1.3.0         htmltools_0.4.0 
  #> [33] assertthat_0.2.1 utf8_1.1.4       stringi_1.4.6    openssl_1.4.1   
  #> [37] HOQCutil_0.1.19  wdman_0.2.5      crayon_1.3.4

Han Oostdijk

2020/03/16

Date last run: 19Mar2020

Load the packages we will use

Define function `get_data`

Define function `unpack_a`

Retrieve a page without RSelenium

Retrieve a page with RSelenium

Summary

Session Info

Scraping links from a webpage

Han Oostdijk

2020/03/16

Date last run: 19Mar2020

Load the packages we will use

Define function get_data

Define function unpack_a

Retrieve a page without RSelenium

Retrieve a page with RSelenium

Summary

Session Info

Define function `get_data`

Define function `unpack_a`