{catchpole} Redux and Hashing Files & Websites with {ssdeepr}

Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:

If we tweak the buffer space around the squares, I think the cartogram looks better:

but, you should likely use a different palette (see this Twitter thread for examples).

I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:

library(sf)library(hrbrthemes) library(catchpole)library(tidyverse)delegates <- read_delegates()candidates_expanded <- expand_candidates()gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))m <- delegates_map()# split off each "area" on the map so we can make a border+backgroundlist(  setdiff(state.abb, c("HI", "AK")),  "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS") %>%   map(~{    suppressWarnings(suppressMessages(st_buffer(      x = st_union(m[m$state %in% .x, ]),      dist = 0.0001,      endCapStyle = "SQUARE"    )))  }) -> m_bordersgg <- ggplot()for (mb in m_borders) {  gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125)}gg +   geom_sf(    data = gsf,    aes(fill = candidate),    col = "white", shape = 22, size = 3, stroke = 0.125  ) +  scale_fill_manual(    name = NULL,    na.value = "#f0f0f0",    values = c(      "Biden" = '#f0027f',      "Sanders" = '#7fc97f',      "Warren" = '#beaed4',      "Buttigieg" = '#fdc086',      "Klobuchar" = '#ffff99',      "Gabbard" = '#386cb0',      "Bloomberg" = '#bf5b17'    ),    limits = intersect(unique(delegates$candidate), names(delegates_pal))  ) +  guides(    fill = guide_legend(      override.aes = list(size = 4)    )  ) +  coord_sf(datum = NA) +  theme_ipsum_es(grid="") +  theme(legend.position = "bottom")


Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.

Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.

I recommend using the hash_con() function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file() doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).

These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?

library(ssdeepr) # see the links above for installationcran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))hash_compare(cran1, cran2)## [1] 0hash_compare(cran1, cran3)## [1] 94

I picked on cran.biotools.fr as I saw they were well-behind CRAN-proper on the monitoring page.

I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!

library(httr)library(ssdeepr)library(splashr)# regular grabh1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))# you need Splash running for javascript-enabled scraping this waysp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")# js-enabled with one uasp %>%  splash_user_agent(ua_macos_chrome) %>%  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%  splash_wait(2) %>%  splash_html(raw_html = TRUE) -> js1# js-enabled with another uasp %>%  splash_user_agent(ua_ios_safari) %>%  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%  splash_wait(2) %>%  splash_html(raw_html = TRUE) -> js2h2 <- hash_raw(js1)h3 <- hash_raw(js2)# same way {rvest} does itres <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")h4 <- hash_raw(content(res, as = "raw"))

Now, let’s compare them:

hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal## [1] 100# things look way different with js-enabledhash_compare(h1, h2)## [1] 0hash_compare(h1, h3)## [1] 0# and with variations between user-agentshash_compare(h2, h3)## [1] 0hash_compare(h2, h4)## [1] 0# only doing this for completenesshash_compare(h3, h4)## [1] 0

For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):

length(js1)## [1] 432914length(js2)## [1] 270538nchar(  paste0(    readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")),    collapse = "\n"  ))## [1] 373078length(content(res, as = "raw"))## [1] 374099


If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!

The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).

As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.

*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2020/03/04/catchpole-redux-and-hashing-files-websites-with-ssdeepr/