Home » Security Bloggers Network » {catchpole} Redux and Hashing Files & Websites with {ssdeepr}

{catchpole} Redux and Hashing Files & Websites with {ssdeepr}

by hrbrmstr on March 4, 2020

Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:

If we tweak the buffer space around the squares, I think the cartogram looks better:

but, you should likely use a different palette (see this Twitter thread for examples).

I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:

library(sf)library(hrbrthemes) library(catchpole)library(tidyverse)delegates <- read_delegates()candidates_expanded <- expand_candidates()gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))m <- delegates_map()# split off each "area" on the map so we can make a border+backgroundlist(  setdiff(state.abb, c("HI", "AK")),  "AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS") %>%   map(~{    suppressWarnings(suppressMessages(st_buffer(      x = st_union(m[m$state %in% .x, ]),      dist = 0.0001,      endCapStyle = "SQUARE"    )))  }) -> m_bordersgg <- ggplot()for (mb in m_borders) {  gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125)}gg +   geom_sf(    data = gsf,    aes(fill = candidate),    col = "white", shape = 22, size = 3, stroke = 0.125  ) +  scale_fill_manual(    name = NULL,    na.value = "#f0f0f0",    values = c(      "Biden" = '#f0027f',      "Sanders" = '#7fc97f',      "Warren" = '#beaed4',      "Buttigieg" = '#fdc086',      "Klobuchar" = '#ffff99',      "Gabbard" = '#386cb0',      "Bloomberg" = '#bf5b17'    ),    limits = intersect(unique(delegates$candidate), names(delegates_pal))  ) +  guides(    fill = guide_legend(      override.aes = list(size = 4)    )  ) +  coord_sf(datum = NA) +  theme_ipsum_es(grid="") +  theme(legend.position = "bottom")

{ssdeepr}

Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.

Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at $DAYJOB), I went ahead and packaged that up as well.

I recommend using the hash_con() function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file() doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).

These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?

library(ssdeepr) # see the links above for installationcran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))hash_compare(cran1, cran2)## [1] 0hash_compare(cran1, cran3)## [1] 94

I picked on cran.biotools.fr as I saw they were well-behind CRAN-proper on the monitoring page.

I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!

library(httr)library(ssdeepr)library(splashr)# regular grabh1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))# you need Splash running for javascript-enabled scraping this waysp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")# js-enabled with one uasp %>%  splash_user_agent(ua_macos_chrome) %>%  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%  splash_wait(2) %>%  splash_html(raw_html = TRUE) -> js1# js-enabled with another uasp %>%  splash_user_agent(ua_ios_safari) %>%  splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%  splash_wait(2) %>%  splash_html(raw_html = TRUE) -> js2h2 <- hash_raw(js1)h3 <- hash_raw(js2)# same way {rvest} does itres <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")h4 <- hash_raw(content(res, as = "raw"))

Now, let’s compare them:

hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal## [1] 100# things look way different with js-enabledhash_compare(h1, h2)## [1] 0hash_compare(h1, h3)## [1] 0# and with variations between user-agentshash_compare(h2, h3)## [1] 0hash_compare(h2, h4)## [1] 0# only doing this for completenesshash_compare(h3, h4)## [1] 0

For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):

length(js1)## [1] 432914length(js2)## [1] 270538nchar(  paste0(    readLines(url("https://en.wikipedia.org/wiki/Donald_Knuth")),    collapse = "\n"  ))## [1] 373078length(content(res, as = "raw"))## [1] 374099

FIN

If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!

The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).

As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.

{catchpole} Redux and Hashing Files & Websites with {ssdeepr}

{ssdeepr}

FIN

Senator Sanders Wants to Own AI Companies — and Hand America’s Adversaries the Keys

NIST’s Nine: The PQC Signature Race Moves to Round Three

The Quantum Arms Race: Why Washington Just Wrote a $2 Billion Check to Nine Companies

Beyond Moore’s Law: The Hyper-Acceleration of Autonomous AI Cyber Capabilities

The Exception Economy: When Security Teams Stop Protecting and Start Negotiating

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Randall Munroe’s XKCD ‘Bottle’