The development version of
splashr now support authenticated connections to Splash API instances. Just specify
pass on the initial
splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with
splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike
splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.
If you have a large number of sites to scrape, working with
splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses
docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have
docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (
8050) to access the Splash API and you get a bonus port of
8036 to watch in your browser (the HAProxy stats page).
This works well when combined with
furrr which is an R package that makes parallel operations very tidy.
One way to use
splashr and Aquarium might look like this:
(Those with a keen eye will grok why
splashr supports Splash API basic authentication, now)
The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and
map_chr() likes to check for proper objects) to check for errors with something like:
flatten_chr(results) %>% keep(str_detect, "Error") ##  "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)" ##  "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)" ##  "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)" ##  "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)" ##  "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"
Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your
plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.
If you were stuck in the
splashr/Splash slow-lane before, give this a try to help save you some time and frustration.
*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/08/13/in-brief-splashr-update-high-performance-scraping-with-splashr-furrr-teamhg-memexs-aquarium/