Home » Security Bloggers Network » Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

by hrbrmstr on July 26, 2018

Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

drill-url-tools for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)
drill-domain-tools for slicing & dicing internet domain names (IDNs).

Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url() function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

Counting Lying Lock Icons

SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

Let’s get the data:

library(sergeant)
library(tidyverse)

read_delim(
  file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", 
  delim = " ", 
  col_names = FALSE,
  col_types = "ccc"
) -> gfg_update

dl_path <- file.path("~/Data/gfg_links.tsv.gz")

if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

db <- src_drill()

tbl(db, "(
SELECT 
  b.host,
  port,
  b.rec.hostname AS hostname,
  b.rec.assigned AS assigned,
  b.rec.tld AS tld,
  b.rec.subdomain AS subdomain
FROM
  (SELECT
    host, port, suffix_extract(host) AS rec             -- break the hostname into components
  FROM
    (SELECT
      a.rec.host AS host, a.rec.port AS port
    FROM
      (SELECT 
        columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components
      FROM dfs.d.`/gfg_links.tsv.gz`) a
    WHERE a.rec.port IS NOT NULL                        -- filter out URL parsing failures
    )
  ) b
WHERE b.rec.tld IS NOT NULL                             -- filter out domain parsing failures
)") -> gfg_df

gfg_df
## # Database: DrillConnection
##    hostname  port host              subdomain assigned      tld  
##                                    
##  1 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  2 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  3 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  4 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  5 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  6 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  7 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  8 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  9 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## 10 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## # ... with more rows

While we could have done it all in SQL, we saved some bits for R:

distinct(gfg_df, assigned, port) %>% 
  count(port) %>% 
  collect() -> port_counts

port_counts
# A tibble: 2 x 2
   port     n
*  
1    80 20648
2   443 22178

You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

FIN

Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

Kick the tyres & file issues & PRs as necessary.

*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/07/26/two-new-apache-drill-udfs-for-processing-urils-and-internet-domain-names/

July 26, 2018July 26, 2018 hrbrmstr Apache Drill, drill, R

Two new Apache Drill UDFs for Processing UR[IL]s and Internet Domain Names

Counting Lying Lock Icons

FIN

Senator Sanders Wants to Own AI Companies — and Hand America’s Adversaries the Keys

NIST’s Nine: The PQC Signature Race Moves to Round Three

The Quantum Arms Race: Why Washington Just Wrote a $2 Billion Check to Nine Companies

Beyond Moore’s Law: The Hyper-Acceleration of Autonomous AI Cyber Capabilities

The Exception Economy: When Security Teams Stop Protecting and Start Negotiating

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Randall Munroe’s XKCD ‘Soniferous Aether’