
{catchpole} Redux and Hashing Files & Websites with {ssdeepr}
Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:
If we tweak the buffer space around the squares, I think the cartogram looks better:
but, you should likely use a different palette (see this Twitter thread for examples).
I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:
library(sf)library(hrbrthemes) library(catchpole)library(tidyverse)delegates <- read_delegates()candidates_expanded <- expand_candidates()gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))m <- delegates_map()# split off each "area" on the map so we can make a border+backgroundlist( setdiff(state.abb, c("HI", "AK")), "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS") %>% map(~{ suppressWarnings(suppressMessages(st_buffer( x = st_union(m[m$state %in% .x, ]), dist = 0.0001, endCapStyle = "SQUARE" ))) }) -> m_bordersgg <- ggplot()for (mb in m_borders) { gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125)}gg + geom_sf( data = gsf, aes(fill = candidate), col = "white", shape = 22, size = 3, stroke = 0.125 ) + scale_fill_manual( name = NULL, na.value = "#f0f0f0", values = c( "Biden" = '#f0027f', "Sanders" = '#7fc97f', "Warren" = '#beaed4', "Buttigieg" = '#fdc086', "Klobuchar" = '#ffff99', "Gabbard" = '#386cb0', "Bloomberg" = '#bf5b17' ), limits = intersect(unique(delegates$candidate), names(delegates_pal)) ) + guides( fill = guide_legend( override.aes = list(size = 4) ) ) + coord_sf(datum = NA) + theme_ipsum_es(grid="") + theme(legend.position = "bottom")
{ssdeepr}
Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.
Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.
I recommend using the hash_con()
function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file()
doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).
These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?
library(ssdeepr) # see the links above for installationcran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))hash_compare(cran1, cran2)## [1] 0hash_compare(cran1, cran3)## [1] 94
I picked on cran.biotools.fr
as I saw they were well-behind CRAN-proper on the monitoring page.
I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!
library(httr)library(ssdeepr)library(splashr)# regular grabh1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))# you need Splash running for javascript-enabled scraping this waysp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")# js-enabled with one uasp %>% splash_user_agent(ua_macos_chrome) %>% splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>% splash_wait(2) %>% splash_html(raw_html = TRUE) -> js1# js-enabled with another uasp %>% splash_user_agent(ua_ios_safari) %>% splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>% splash_wait(2) %>% splash_html(raw_html = TRUE) -> js2h2 <- hash_raw(js1)h3 <- hash_raw(js2)# same way {rvest} does itres <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")h4 <- hash_raw(content(res, as = "raw"))
Now, let’s compare them:
hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal## [1] 100# things look way different with js-enabledhash_compare(h1, h2)## [1] 0hash_compare(h1, h3)## [1] 0# and with variations between user-agentshash_compare(h2, h3)## [1] 0hash_compare(h2, h4)## [1] 0# only doing this for completenesshash_compare(h3, h4)## [1] 0
For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):
length(js1)## [1] 432914length(js2)## [1] 270538nchar( paste0( readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")), collapse = "\n" ))## [1] 373078length(content(res, as = "raw"))## [1] 374099
FIN
If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!
The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).
As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.
*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2020/03/04/catchpole-redux-and-hashing-files-websites-with-ssdeepr/