Date last run: 19Mar2020
In a previous blog entry I showed how to scrape a page that uses javascript
to fill a table. Doing that I realized that this also explained why I saw a difference in two ways to display the html
of a page:
- display the page in the (Firefox) browser and use
File | Save As
- load the page with
xml2::read_html
and save it withxml2::write_html
In the first case the browser executes the javascript
code and in the second the javascript
code was not executed until I decided to use the R package RSelenium.
In this blog entry I read a webpage with and without using RSelenium. I then use the xml2 package to extract the links from the page. The number of links is very much different.
Load the packages we will use
HOQCutil::silent_library(c('xml2','magrittr','purrr','dplyr','tibble','RSelenium'))
Define function get_data
The function get_data
will retrieve an html
document from the internet of a local file depending on the argument use_cache
:
use_cache == T
: thehtml
document is read from the local file with namefilename
use_cache == F
: thehtml
document is read from the internet resource namedurl
. In this case thehtml
document is written to the local file with namefilename
(when specified). IfrD == NULL
: (i.e.is.null(rD) == T
) the normalxml2::read_html
is usedrD != NULL
: (i.e.is.null(rD) != T
) andrD
is arsClientServer
object the packageRSelenium
is used to read the page
get_data <-
function(filename = NULL,
use_cache = F,
url = NULL,
rD = NULL) {
read_html2 <- function(url) {
if (is.null(rD)) {
xml2::read_html(url)
} else if ("rsClientServer" %in% class(rD)) {
remDr <- rD[["client"]]
remDr$navigate(url)
remDr$switchToFrame(NULL)
doc = xml2::read_html(remDr$getPageSource()[[1]])
remDr$close()
rD[["server"]]$stop()
doc
} else {
doc = NULL
}
}
if (use_cache == F) {
doc = read_html2(url)
if (!is.null(filename)) {
xml2::write_html(doc, filename)
}
} else {
doc = xml2::read_html(filename)
}
doc
}
Define function unpack_a
This function extract the ‘description’ and ‘link’ part of an anchor element
.
From the element <a class="latest-column" href="https://www.globaltimes.cn/opinion/viewpoint/">VIEWPOINT</a>
it will extract ‘VIEWPOINT’ and ‘https://www.globaltimes.cn/opinion/viewpoint/' .
unpack_a <- function (doc) {
if (length(doc)>0) {
c1 = xml2::xml_text(doc)
c2 = xml2::xml_attr(doc, 'href')
c1 = ifelse(is.null(c1),'',c1)
c2 = ifelse(is.null(c2),'',c2)
} else {
c1 = ''
c2 = ''
}
tibble::tibble(c1 = c1, c2 = c2)
}
Retrieve a page without RSelenium
doc1 = get_data(filename = NULL, use_cache = F,
url = 'http://www.globaltimes.cn/', rD = NULL)
doc1_a = xml2::xml_find_all(doc1, "//a")
xx1 = purrr::map_dfr(doc1_a,unpack_a)
dim(xx1)
#> [1] 46 2
print(xx1)
#> # A tibble: 46 x 2
#> c1 c2
#> <chr> <chr>
#> 1 "" https://www.globaltimes.cn/
#> 2 " " https://www.globaltimes.cn//special-cover~
#> 3 "OP-ED" https://www.globaltimes.cn/opinion/
#> 4 "EDITORIAL" https://www.globaltimes.cn/opinion/editor~
#> 5 "US government loses decency unde~ https://www.globaltimes.cn/content/118319~
#> 6 "OBSERVER" https://www.globaltimes.cn/opinion/observ~
#> 7 "Trump puts Asian Americans at ri~ https://www.globaltimes.cn/content/118319~
#> 8 "VIEWPOINT" https://www.globaltimes.cn/opinion/viewpo~
#> 9 "Germans hear little of China's a~ https://www.globaltimes.cn/content/118314~
#> 10 "GT VOICE" https://www.globaltimes.cn/source/GT-Voic~
#> # ... with 36 more rows
We see that we retrieve in this way 46 anchor elements.
Retrieve a page with RSelenium
The difference with the previous section is only the creation of the rD
object and its use in the get_data
call. And afterwards we try to get rid of the RSelenium resources. This succeeds with the annoying exception of the port
number (in this case 4568) that is not freed (as described in
Using RSelenium to scrape a webpage)
selenium_port = 4568L
rD = rsDriver(browser = 'firefox', port = selenium_port, verbose = F)
doc2 = get_data(filename = NULL, use_cache = F,
url = 'http://www.globaltimes.cn/', rD = rD)
doc2_a = xml2::xml_find_all(doc2, "//a")
xx2 = purrr::map_dfr(doc2_a,unpack_a)
dim(xx2)
#> [1] 467 2
print(xx2)
#> # A tibble: 467 x 2
#> c1 c2
#> <chr> <chr>
#> 1 " " https://www.globaltimes.cn/
#> 2 "\n " https://www.globaltimes.cn/
#> 3 " \n \n load_file(\"/includes/~ https://www.globaltimes.cn/
#> 4 " " https://www.globaltimes.cn/
#> 5 "E-Paper" http://epaper.globaltimes.cn
#> 6 "Mobile" http://mobile.globaltimes.cn
#> 7 "Apps" http://www.globaltimes.cn/conten~
#> 8 "Sina Weibo" http://weibo.com/globaltimescn/
#> 9 "Facebook" http://www.facebook.com/globalti~
#> 10 "Twitter" https://twitter.com/globaltimesn~
#> # ... with 457 more rows
rm(rD)
gc()
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 868429 46.4 1743240 93.1 1164778 62.3
#> Vcells 1528778 11.7 8388608 64.0 2363835 18.1
We see that we retrieve in this way 467 anchor elements.
Summary
With little additional effort one can extract more data (the whole page?) of a webpage by using the RSelenium package.
Session Info
This document was produced on 19Mar2020 with the following R environment:
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RSelenium_1.7.7 tibble_2.1.3 dplyr_0.8.5 purrr_0.3.3
#> [5] magrittr_1.5 xml2_1.2.5
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4 pillar_1.4.3 compiler_3.6.0 bitops_1.0-6
#> [5] tools_3.6.0 digest_0.6.25 jsonlite_1.6.1 evaluate_0.14
#> [9] pkgconfig_2.0.3 rlang_0.4.5 cli_2.0.2 curl_4.3
#> [13] yaml_2.2.0 xfun_0.10 binman_0.1.1 httr_1.4.1
#> [17] stringr_1.4.0 knitr_1.28 rappdirs_0.3.1 vctrs_0.2.4
#> [21] askpass_1.1 caTools_1.17.1.1 tidyselect_1.0.0 glue_1.3.2
#> [25] R6_2.4.1 processx_3.4.1 fansi_0.4.1 XML_3.98-1.20
#> [29] rmarkdown_2.1 semver_0.2.0 ps_1.3.0 htmltools_0.4.0
#> [33] assertthat_0.2.1 utf8_1.1.4 stringi_1.4.6 openssl_1.4.1
#> [37] HOQCutil_0.1.19 wdman_0.2.5 crayon_1.3.4