Home » Security Bloggers Network » Quick Hit: Comparison of “Whole File Reading” Methods

Quick Hit: Comparison of “Whole File Reading” Methods

by hrbrmstr on August 7, 2020

(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.)

I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this:

HTTP/1.1 200 OKDate: Mon, 08 Jun 2020 14:40:45 GMTServer: ApacheLast-Modified: Sun, 26 Apr 2020 00:06:47 GMTETag: "ace-ec1a0-5a4265fd413c0"Accept-Ranges: bytesContent-Length: 967072X-Frame-Options: SAMEORIGINContent-Type: application/x-msdownload

I do this quite a bit in R when we create new studies at work, but I’m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.

Reading in a bunch of files (each one into a string) is fairly straightforward in R since readChar() will do the work of reading and we just wrap that in an iterator:

length(fils)## [1] 73514 # check file size distributionsummary(  vapply(    X = fils,    FUN = file.size,    FUN.VALUE = numeric(1),    USE.NAMES = FALSE  ))## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. ## 19.0   266.0   297.0   294.8   330.0  1330.0 # they're all super smallsystem.time(  vapply(    X = fils,     FUN = function(.f) readChar(.f, file.size(.f)),     FUN.VALUE = character(1),     USE.NAMES = FALSE  ) -> tmp )##  user  system elapsed ## 2.754   1.716   4.475

NOTE: You can use lapply() or sapply() to equal effect as they all come in around 5 seconds on a modern SSD-backed system.

Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but can we do better? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:

base::readChar()
readr::read_file()
stringi::stri_read_raw() (+ rawToChar())

Do any of them beat {base}? Let’s see (using the largest of the files):

library(stringi)library(readr)library(microbenchmark)largest <- fils[which.max(sapply(fils, file.size))]file.size(largest)## [1] 1330microbenchmark(  base = readChar(largest, file.size(largest)),  readr = read_file(largest),  stringi = rawToChar(stri_read_raw(largest)),  times = 1000,  control = list(warmup = 100))## Unit: microseconds##     expr     min       lq      mean   median       uq     max neval##     base  79.862  93.5040  98.02751  95.3840 105.0125 161.566  1000##    readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256  1000##  stringi  52.113  60.9690  67.17392  64.4185  74.9895 249.427  1000

I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it — for good reasons — which doesn’t help it win any performance contests).

I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like “sands” some corners, really). We’ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):

library(Rcpp)cppFunction(code = 'String cpp_read_file(std::string fil) {  // our input stream  std::ifstream in(fil, std::ios::in | std::ios::binary);  if (in) { // we can work with the file  #ifdef Win32    struct _stati64 st; // gosh i hate windows    _wstati64(wfn, &st) // this shld work but I did not test it  #else    struct stat st;    stat(fil.c_str(), &st);  #endif    std::string out; // where we will store the contents of the file    out.reserve(st.st_size); // make string size == file size    in.seekg(0, std::ios::beg); // ensure we are at the beginning    in.read(&out[0], out.size()); // read in the file    in.close();    return(out);  } else {    return(NA_STRING); // file missing or other errors returns NA  }}', includes = c(  "#include <fstream>",  "#include <string>",  "#include <sys/stat.h>"))

Is that going to be faster?

microbenchmark(  base = readChar(largest, file.size(largest)),  readr = read_file(largest),  stringi = rawToChar(stri_read_raw(largest)),  rcpp = cpp_read_file(largest),  times = 1000,  control = list(warmup = 100))## Unit: microseconds##     expr     min       lq      mean   median       uq     max neval##     base  80.500  91.6910  96.82752  94.3475 100.6945 295.025  1000##    readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850  1000##  stringi  51.959  60.8115  66.24508  63.9250  71.0765 171.644  1000##     rcpp  15.072  18.3485  21.20275  21.0930  22.6360  62.988  1000

It sure looks like it, but let’s put it to the test:

system.time(  vapply(    X = fils,     FUN = cpp_read_file,     FUN.VALUE = character(1),     USE.NAMES = FALSE  ) -> tmp )##  user  system elapsed ## 0.446   1.244   1.693

I’ll take a two-second wait over a five-second wait any day!

FIN

I have a few more cases coming up where there will be 3-5x the number of (similar) files that I’ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.

The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.

If I missed your favorite file slurping function, drop a note in the comments and I’ll update the post with new benchmarks.

Quick Hit: Comparison of “Whole File Reading” Methods

FIN

Senator Sanders Wants to Own AI Companies — and Hand America’s Adversaries the Keys

NIST’s Nine: The PQC Signature Race Moves to Round Three

The Quantum Arms Race: Why Washington Just Wrote a $2 Billion Check to Nine Companies

Beyond Moore’s Law: The Hyper-Acceleration of Autonomous AI Cyber Capabilities

The Exception Economy: When Security Teams Stop Protecting and Start Negotiating

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Randall Munroe’s XKCD ‘Bottle’