Home » Security Bloggers Network » Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package

Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package

by hrbrmstr on April 3, 2019

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

SourceHut
GitLab
CINC (The CINC repo has R 3.5 package binaries for Windows and macOS.)
GitUgh

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

will install from source or binaries onto your system(s).

Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)td <- td_create()for (i in 1:100000) {  td_add(td, sample(100, 1), 1)}quantile(td)## [1]   1.00000  25.62222  53.09883  74.75522 100.00000

More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)x <- sample(100, 1000000, replace=TRUE)td <- tdigest(x)quantile(td)## [1]   1.00000  25.91914  50.79468  74.76439 100.00000

Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)))## Unit: microseconds##        expr      min         lq        mean    median       uq       max neval##     tdigest    22.81    26.6525    48.70123    53.355    63.31    151.29   100##  r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50   100

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

Wicked Fast, Accurate Quantiles Using ‘t-Digests’ in R with the {tdigest} Package

Basic Ops

More R-like Ops

Need for Speed

FIN

Senator Sanders Wants to Own AI Companies — and Hand America’s Adversaries the Keys

NIST’s Nine: The PQC Signature Race Moves to Round Three

The Quantum Arms Race: Why Washington Just Wrote a $2 Billion Check to Nine Companies

Beyond Moore’s Law: The Hyper-Acceleration of Autonomous AI Cyber Capabilities

The Exception Economy: When Security Teams Stop Protecting and Start Negotiating

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Fortinet® Follies