The Fix Is In: Finding infix functions inside contributed R package “utilities” files

Regular readers will recall the “utility belt” post from back in April of this year. This is a follow-up to a request made asking for a list of all the % infix functions in those files.

We’re going to:

  • collect up all of the sources
  • parse them
  • find all the definitions of % infix functions
  • write them to a file

We’ll start by grabbing the data from the previous post and look at it as a refresher:


library(stringi)
library(tidyverse)

utils <- read_rds(url("https://rud.is/dl/utility-belt.rds"))

utils
## # A tibble: 1,746 x 13
##    permsissions links owner     group      size month day   year_hr path   date       pkg   fil   file_src             
##  1 -rw-r--r--       0 hornik    users      1658 Jun   05    2016    AHR/R… 2016-06-05 AHR   util… "## \\int f(x)dg(x) …
##  2 -rw-r--r--       0 ligges    users     12609 Dec   13    2016    ALA4R… 2016-12-13 ALA4R util… "## some utility fun…
##  3 -rw-r--r--       0 hornik    users         0 Feb   24    2017    AWR.K… 2017-02-24 AWR.… util… ""                   
##  4 -rw-r--r--       0 ligges    users      4127 Aug   30    2017    Alpha… 2017-08-30 Alph… util… "#\n#' Assign API ke…
##  5 -rw-r--r--       0 ligges    users       121 Jan   19    2017    Amylo… 2017-01-19 Amyl… util… "make_decision <- fu…
##  6 -rw-r--r--       0 herbrandt herbrandt    52 Aug   10    2017    BANES… 2017-08-10 BANE… util… "#' @importFrom dply…
##  7 -rw-r--r--       0 ripley    users     36977 Jan   06    2015    BEQI2… 2015-01-06 BEQI2 util… "#' \tRemove Redunda…
##  8 -rw-r--r--       0 hornik    users     34198 May   10    2017    BGDat… 2017-05-10 BGDa… util… "# A more memory-eff…
##  9 -rwxr-xr-x       0 ligges    users      3676 Aug   14    2016    BGLR/… 2016-08-14 BGLR  util… "\n readBinMat=funct…
## 10 -rw-r--r--       0 ripley    users      2547 Feb   04    2015    BLCOP… 2015-02-04 BLCOP util… "###################…
## # ... with 1,736 more rows

Note that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.

Now, let’s find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:


filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions
  pull(file_src) %>%
  paste0(collapse="\n\n") %>%
  parse(text = ., keep.source=TRUE) -> infix_src

str(infix_src, 1)
## length 1364 expression(dplyr::`%>%`, `%||%` <- function(a, b) if (is.null(a)) b else a, get_pkg_path <- function(ctx) {  pkg_| __truncated__ ...
##  - attr(*, "srcref")=List of 1364
##  - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'  
##  - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768
##   ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' 

We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:


infix_parsed <- tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing 

infix_parsed
## # A tibble: 118,242 x 9
##    line1  col1 line2  col2    id parent token          terminal text      
##  1     1     1     1    24     1    -10 COMMENT        TRUE     #' @impor…
##  2     2     1     2    10     4    -10 COMMENT        TRUE     #' @export
##  3     3     1     3    12    10      0 expr           FALSE    ""        
##  4     3     1     3     5     7     10 SYMBOL_PACKAGE TRUE     dplyr     
##  5     3     6     3     7     8     10 NS_GET         TRUE     ::        
##  6     3     8     3    12     9     10 SYMBOL         TRUE     `%>%`     
##  7     5     1     5    49    51      0 expr           FALSE    ""        
##  8     5     1     5     6    16     18 SYMBOL         TRUE     `%||%`    
##  9     5     1     5     6    18     51 expr           FALSE    ""        
## 10     5     8     5     9    17     51 LEFT_ASSIGN    TRUE     <-        
## # ... with 118,232 more rows

We just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:


pat <- c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition

# find all of ^^ sequences (there's a good twitter discussion on this abt a month ago)
idx <- which(infix_parsed$token == pat[1]) # find location of match of start of seq

# look for the rest of the sequences starting at each idx position
map_lgl(idx, ~{
  all(infix_parsed$token[.x:(.x+(length(pat)-1))] == pat)
}) -> found

f_defs <- idx[found] # starting indices of all the places where functions are defined

# filter ^^ to only find infix ones
infix_defs <- f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")]

# there aren't too many, but remember we're just searching `util` functions
length(infix_defs)
## [1] 106

Now, write it out to a file so we can peruse the infix functions:


# nuke a file and fill it with the function definition
cat("", sep="", file="infix_functions.R")
walk2(
  getParseText(infix_parsed, infix_parsed$id[infix_defs]),     # extract the infix name
  getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body
  ~{
    cat(.x, " <- ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE)
  }
)

There are 106 of them so you can find the extracted ones in this gist.

Here's an overview of what you can expect to find:

# A tibble: 39 x 2
   name                 n
 1 `%||%`              47
 2 `%+%`                7
 3 `%AND%`              4
 4 `%notin%`            4
 5 `%:::%`              3
 6 `%==%`               3
 7 `%!=%`               2
 8 `%*diag%`            2
 9 `%diag*%`            2
10 `%nin%`              2
11 `%OR%`               2
12 `%::%`               1
13 `%??%`               1
14 `%.%`                1
15 `%@%`                1
16 `%&&%`               1
17 `%&%`                1
18 `%+&%`               1
19 `%++%`               1
20 `%+|%`               1
21 `%<<%`               1
22 `%>>%`               1
23 `%~~%`               1
24 `%assert_class%`     1
25 `%contains%`         1
26 `%din%`              1
27 `%fin%`              1
28 `%identical%`        1
29 `%In%`               1
30 `%inr%`              1
31 `%M%`                1
32 `%notchin%`          1
33 `%or%`               1
34 `%p%`                1
35 `%pin%`              1
36 `%R%`                1
37 `%s%`                1
38 `%sub_in%`           1
39 `%sub_nin%`          1

FIN

If any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infix-helpers.R (and add yourself to the DESCRIPTION if you do).

Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.

*** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2018/05/29/the-fix-is-in-finding-infix-functions-inside-contributed-r-package-utilities-files/