On MIMEs, software versions and web site promiscuity (a.k.a. three new packages to round out the week)

A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.

Alleviating aversion to versions

I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist🔗 package.

Here’s a sample:

apache_httpd_version_history()
## # A tibble: 29 x 8
##    vers   rls_date   rls_year major minor patch prerelease build
##    <fct>  <date>        <dbl> <int> <int> <int> <chr>      <chr>
##  1 1.3.0  1998-06-05     1998     1     3     0 ""         ""   
##  2 1.3.1  1998-07-22     1998     1     3     1 ""         ""   
##  3 1.3.2  1998-09-21     1998     1     3     2 ""         ""   
##  4 1.3.3  1998-10-09     1998     1     3     3 ""         ""   
##  5 1.3.4  1999-01-10     1999     1     3     4 ""         ""   
##  6 1.3.6  1999-03-23     1999     1     3     6 ""         ""   
##  7 1.3.9  1999-08-19     1999     1     3     9 ""         ""   
##  8 1.3.11 2000-01-22     2000     1     3    11 ""         ""   
##  9 1.3.12 2000-02-25     2000     1     3    12 ""         ""   
## 10 1.3.14 2000-10-10     2000     1     3    14 ""         ""   
## # ... with 19 more rows

Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.

It has current support for:

  • Apache httpd
  • Apple iOS
  • Google Chrome
  • lighttpd
  • memcached
  • MongoDB
  • MySQL
  • nginx
  • openresty
  • openssh
  • sendmail
  • SQLite

and I’ll add more over time.

Thanks to @bikesRdata there will be a …_latest() function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).

And, they all it a MIME…a MIME!

I’ve had the wand🔗 package out for a while but have never been truly happy with it. It uses libmagic on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).

A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic🔗. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.

list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>% 
  purrr::map_df(~{
    dplyr::data_frame(
      fil = basename(.x),
      mime = list(simplemagic::get_content_type(.x))
    )
  }) %>% 
  tidyr::unnest()
## # A tibble: 85 x 2
##    fil                        mime                                                                     
##    <chr>                      <chr>                                                                    
##  1 actions.csv                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  2 actions.txt                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  3 actions.xlsx               application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  4 test_1.2.class             application/java-vm                                                      
##  5 test_1.3.class             application/java-vm                                                      
##  6 test_1.4.class             application/java-vm                                                      
##  7 test_1.5.class             application/java-vm                                                      
##  8 test_128_44_jstereo.mp3    audio/mp3                                                                
##  9 test_excel_2000.xls        application/msword                                                       
## 10 test_excel_spreadsheet.xml application/xml      
## ...

File issues or PRs if you need more header-magic introspected guesses.

NOTE: The rtika🔗 package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.

Site promiscuity detection

urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr🔗 package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.

I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.

The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com makes you:

cnn_db <- urlscan::urlscan_search("domain:cnn.com")

latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)

latest_scan_results$scan_result$lists$ips
##  [1] "151.101.65.67"   "151.101.113.67"  "2.19.34.83"     
##  [4] "2.20.22.7"       "2.16.186.112"    "54.192.197.56"  
##  [7] "151.101.114.202" "83.136.250.242"  "157.166.238.142"
## [10] "13.32.217.114"   "23.67.129.200"   "2.18.234.21"    
## [13] "13.32.145.105"   "151.101.112.175" "172.217.21.194" 
## [16] "52.73.250.52"    "172.217.18.162"  "216.58.210.2"   
## [19] "172.217.23.130"  "34.238.24.243"   "13.107.21.200"  
## [22] "13.32.159.194"   "2.18.234.190"    "104.244.43.16"  
## [25] "54.192.199.124"  "95.172.94.57"    "138.108.6.20"   
## [28] "63.140.33.27"    "2.19.43.224"     "151.101.114.2"  
## [31] "74.201.198.92"   "54.76.62.59"     "151.101.113.194"
## [34] "2.18.233.186"    "216.58.207.70"   "95.172.94.20"   
## [37] "104.244.42.5"    "2.18.234.36"     "52.94.218.7"    
## [40] "62.67.193.96"    "62.67.193.41"    "69.172.216.55"  
## [43] "13.32.145.124"   "50.31.185.52"    "54.210.114.183" 
## [46] "74.120.149.167"  "64.202.112.28"   "185.60.216.19"  
## [49] "54.192.197.119"  "185.60.216.35"   "46.137.176.25"  
## [52] "52.73.56.77"     "178.250.2.67"    "54.229.189.67"  
## [55] "185.33.223.197"  "104.244.42.3"    "50.16.188.173"  
## [58] "50.16.238.189"   "52.59.88.2"      "52.38.152.125"  
## [61] "185.33.223.80"   "216.58.207.65"   "2.18.235.40"    
## [64] "69.172.216.58"   "107.23.150.218"  "34.192.246.235" 
## [67] "107.23.209.129"  "13.32.145.107"   "35.157.255.181" 
## [70] "34.228.72.179"   "69.172.216.111"  "34.205.202.95"

latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"

latest_scan_results$scan_result$lists$domains
##  [1] "cdn.cnn.com"                    "edition.i.cdn.cnn.com"         
##  [3] "edition.cnn.com"                "dt.adsafeprotected.com"        
##  [5] "pixel.adsafeprotected.com"      "securepubads.g.doubleclick.net"
##  [7] "tpc.googlesyndication.com"      "z.moatads.com"                 
##  [9] "mabping.chartbeat.net"          "fastlane.rubiconproject.com"   
## [11] "b.sharethrough.com"             "geo.moatads.com"               
## [13] "static.adsafeprotected.com"     "beacon.krxd.net"               
## [15] "revee.outbrain.com"             "smetrics.cnn.com"              
## [17] "pagead2.googlesyndication.com"  "secure.adnxs.com"              
## [19] "0914.global.ssl.fastly.net"     "cdn.livefyre.com"              
## [21] "logx.optimizely.com"            "cdn.krxd.net"                  
## [23] "s0.2mdn.net"                    "as-sec.casalemedia.com"        
## [25] "errors.client.optimizely.com"   "social-login.cnn.com"          
## [27] "invocation.combotag.com"        "sb.scorecardresearch.com"      
## [29] "secure-us.imrworldwide.com"     "bat.bing.com"                  
## [31] "jadserve.postrelease.com"       "ssl.cdn.turner.com"            
## [33] "cnn.sdk.beemray.com"            "static.chartbeat.com"          
## [35] "native.sharethrough.com"        "www.cnn.com"                   
## [37] "btlr.sharethrough.com"          "platform-cdn.sharethrough.com" 
## [39] "pixel.moatads.com"              "www.summerhamster.com"         
## [41] "mms.cnn.com"                    "ping.chartbeat.net"            
## [43] "analytics.twitter.com"          "sharethrough.adnxs.com"        
## [45] "match.adsrvr.org"               "gum.criteo.com"                
## [47] "www.facebook.com"               "d3qdfnco3bamip.cloudfront.net" 
## [49] "connect.facebook.net"           "log.outbrain.com"              
## [51] "serve2.combotag.com"            "rva.outbrain.com"              
## [53] "odb.outbrain.com"               "dynaimage.cdn.cnn.com"         
## [55] "data.api.cnn.io"                "aax.amazon-adsystem.com"       
## [57] "cdns.gigya.com"                 "t.co"                          
## [59] "pixel.quantserve.com"           "ad.doubleclick.net"            
## [61] "cdn3.optimizely.com"            "w.usabilla.com"                
## [63] "amplifypixel.outbrain.com"      "tr.outbrain.com"               
## [65] "mab.chartbeat.com"              "data.cnn.com"                  
## [67] "widgets.outbrain.com"           "secure.quantserve.com"         
## [69] "static.ads-twitter.com"         "amplify.outbrain.com"          
## [71] "tag.bounceexchange.com"         "adservice.google.com"          
## [73] "adservice.google.com.ua"        "www.googletagservices.com"     
## [75] "cdn.adsafeprotected.com"        "js-sec.indexww.com"            
## [77] "ads.rubiconproject.com"         "c.amazon-adsystem.com"         
## [79] "www.ugdturner.com"              "a.postrelease.com"             
## [81] "cdn.optimizely.com"             "cnn.com"

O_o

FIN

Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.



*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/03/23/on-mimes-software-versions-and-web-site-promiscuity-a-k-a-three-new-packages-to-round-out-the-week/