Freeing PDF Data to Account for the Unaccounted

I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.

When I finally jumped on Twitter today, I saw this:

and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.

The Problem

The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).

There PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:

Click to view slideshow.

We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.

The Solution

We’ll need some packages and the file itself, so let’s get that bit out of the way first:

library(stringi)library(pdftools)library(hrbrthemes)library(ggpomological)library(tidyverse)# grab the PDF textmia_url <- ""mia_fil <- "~/Data/pmkor_una_all.pdf"if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)# read it indoc <- pdf_text(mia_fil) 

Let's look at those three example pages:

cat(doc[[1]])##                                   Defense POW/MIA Accounting Agency##                                       Personnel Missing - Korea (PMKOR)##                                        (Reported for ALL Unaccounted For)##                                                                                                Total Unaccounted: 7,699## Name                       Rank/Rate     Branch                           Year State/Territory## ABBOTT, RICHARD FRANK      M/Sgt         UNITED STATES ARMY               1950 VERMONT## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA## ...## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIAcat(doc[[2]])## Name                          Rank/Rate Branch                     Year State/Territory## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII...## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINAcat(doc[[length(doc)]])## Name                                                Rank/Rate           Branch                                              Year         State/Territory## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO## ...## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS##                                       This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).##                Please visit our web site at for updates to this list and other official missing personnel data lists.## Report Prepared: 06/19/2018 11:25

The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:

  • there's a uniform header on each page
  • the "layout" mode returned uniform per-page, fixed-width columns
  • there's no "special column tweaks" that some folks use to make PDFs more readable by humans

There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:

  • convert the text to lines
  • find the header line
  • find the column start/end positions from the header on the page (since they are different for each page)
  • reading it in with readr::read_fwf()
  • remove headers, preamble and epilogue cruft
  • turn it all into one data frame
# we're going to process each page and read_fwf will complain violently# when it hits header/footer rows vs data rows and we rly don't need to# see all those warningsread_fwf_q <- quietly(read_fwf)# go through each pagemap_df(doc, ~{    stri_split_lines(.x) %>%     flatten_chr() -> lines # want the lines from each page    # find the header on the page and get the starting locations for each column  keep(lines, stri_detect_regex, "^Name") %>%     stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>%     map(`[`, 1) %>%     flatten_int() -> starts    # now get the ending locations; cheating and using `NA` for the last column    ends <- c(starts[-1] - 1, NA)  # since each page has a lovely header and poppler's "layout" mode creates   # a surprisingly usable fixed-width table, the core idiom is to find the start/end  # of each column using the header as a canary  cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))  paste0(lines, collapse="\n") %>%        # turn it into something read_fwf() can read     read_fwf_q(col_positions = cols) %>%  # read it!    .$result %>%                          # need to do this b/c of `quietly()`    filter(! %>%              # non-data lines    filter(name != "Name") %>%            # remove headers from each page    filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)  }) -> xdfxdf## # A tibble: 7,699 x 5##    name                       rank   branch                  year  state        ##                                                        ##  1 ABBOTT, RICHARD FRANK      M/Sgt  UNITED STATES ARMY      1950  VERMONT      ##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA ##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  ##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   ##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      ##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  ##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     ##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     ## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE## # ... with 7,689 more rows

Now the data is both usable and sobering:

title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"subtitle <- "Reported for ALL Unaccounted For"caption <-  "Source:"mutate(xdf, year = factor(year)) %>%   mutate(branch = stri_trans_totitle(branch)) -> xdfordr <- count(xdf, branch, sort=TRUE)mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>%   ggplot(aes(year)) +  geom_bar(aes(fill = branch), width=0.65) +  scale_y_comma(name = "# POW/MIA") +  scale_fill_pomological(name=NULL, ) +  labs(x = NULL, title = title, subtitle = subtitle) +  theme_ipsum_rc(grid="Y") +  theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +  theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))

*** This is a Security Bloggers Network syndicated blog from authored by hrbrmstr. Read the original post at: