ggplot “Doodling” with HIBP Breaches

After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

You can see a full-size version of the image via this link.

library(hrbrthemes) # use github or gitlab versionlibrary(tidyverse)# get the datadat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json"jsonlite::fromJSON(dat_url) %>%   mutate(BreachDate = as.Date(BreachDate)) %>%   tbl_df() -> breaches# selected breach labels dfgroup_by(breaches, year = lubridate::year(BreachDate)) %>%   top_n(1, wt=PwnCount) %>%   ungroup() %>%   filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely  mutate(    lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000))  ) %>%   arrange(year) -> labs# num of known breaches in that year for labelscount(breaches, year = lubridate::year(BreachDate)) %>%   mutate(nlab = sprintf("n=%s", n)) %>%   mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_ctsmutate(breaches, p_half = PwnCount/2) %>% # for centered segments  ggplot() +  geom_segment( # centered segments    aes(BreachDate, p_half, xend=BreachDate, yend=-p_half),     color = ft_cols$yellow, size = 0.3  ) +  geom_text( # selected breach labels    data = labs, aes(BreachDate, PwnCount/2, label = lab),    lineheight = 0.875, size = 3.25, family = font_rc,    hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25),    nudge_y = 0,  color = ft_cols$slate  ) +  geom_text( # top year labels    data = year_cts, aes(lab_x, Inf, label = year), family = font_rc,     size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray  ) +  geom_text( # bottom known breach count totals    data = year_cts, aes(lab_x, -Inf, label = nlab, size = n),     vjust = 0, lineheight = 0.875, color = ft_cols$peach,    family = font_rc, show.legend = FALSE  ) +  scale_x_date( # break on year    name = NULL, date_breaks = "1 year", date_labels = "%Y"  ) +  scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels  scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes   labs(    title = "HIBP (Known) Breach Frequency & Size",    subtitle = "Segment length is number of accounts; n=# known account breaches that year",    caption = "Source: HIBP via "  ) +  theme_ft_rc(grid="X") +  theme(axis.text.y = element_blank()) +  theme(axis.text.x = element_blank())


*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/07/29/ggplot-doodling-with-hibp-breaches/