Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

  • drill-url-tools🔗 for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)
  • drill-domain-tools🔗 for slicing & dicing internet domain names (IDNs).

Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()🔗 function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

Counting Lying Lock Icons

SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

Let’s get the data:

library(sergeant)library(tidyverse)read_delim(  file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt",   delim = " ",   col_names = FALSE,  col_types = "ccc") -> gfg_updatedl_path <- file.path("~/Data/gfg_links.tsv.gz")if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

db <- src_drill()tbl(db, "(SELECT   b.host,  port,  b.rec.hostname AS hostname,  b.rec.assigned AS assigned,  b.rec.tld AS tld,  b.rec.subdomain AS subdomainFROM  (SELECT    host, port, suffix_extract(host) AS rec             -- break the hostname into components  FROM    (SELECT      a.rec.host AS host, a.rec.port AS port    FROM      (SELECT         columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components      FROM dfs.d.`/gfg_links.tsv.gz`) a    WHERE a.rec.port IS NOT NULL                        -- filter out URL parsing failures    )  ) bWHERE b.rec.tld IS NOT NULL                             -- filter out domain parsing failures)") -> gfg_dfgfg_df## # Database: DrillConnection##    hostname  port host              subdomain assigned      tld  ##                                    ##  1 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  2 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  3 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  4 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  5 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  6 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  7 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  8 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ##  9 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ## 10 www         80 www.eestikirik.ee NA        eestikirik.ee ee   ## # ... with more rows

While we could have done it all in SQL, we saved some bits for R:

distinct(gfg_df, assigned, port) %>%   count(port) %>%   collect() -> port_countsport_counts# A tibble: 2 x 2   port     n*  1    80 206482   443 22178

You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

FIN

Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

Kick the tyres & file issues & PRs as necessary.



*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/07/26/two-new-apache-drill-udfs-for-processing-urils-and-internet-domain-names/