Garbage In, Gospel Out: The Security Problem of Data Accuracy
The accuracy or integrity of data is only as good as its source
In two separate incidents, one in Colorado and one in Washington, D.C., police at gunpoint stopped people who were not committing any crimes, ordered young families out of their cars at gunpoint and further ordered them to lie on the ground. In the Colorado case, the police blamed the fact that the SUV stopped reportedly had the same license plate number as a motorcycle previously reported as stolen (the SUV had also been reported stolen and then reported recovered). In the D.C. case, the Secret Service agents told the women arrested that their car had been reported stolen and that the suspects were two Black men. In another recent case, Florida elected State’s Attorney Aramis Ayala was stopped by police because the cops, “ran her tag” and the tag “did not come back” in the computer database. The police explained that they “run the tags” of vehicles driving by routinely, and pull cars over to interrogate drivers and passengers based on what the computer database shows.
These cases illustrate a problem often ignored when it comes to computer and information security: the problem of bad data.
Typically, infosec relies on the “CIA” of information—Confidentiality, Integrity and Availability. But “integrity” and accuracy are not the same thing.
Data integrity typically means that the data in the computer, log, database or whatever has not been altered without authorization. But it means nothing about data accuracy. Protecting databases that themselves contain inaccurate information may lead to a false sense that the information—being “secure” is therefore “reliable.” And this can mean that people get killed.
In 1991, Isaac Evans was pulled over by the Phoenix police for driving the wrong way on a one-way street. A database check showed that there was an outstanding arrest warrant. Evans was arrested and his car searched, and the cops found a small amount of marijuana. Of course, there was no arrest warrant—the computer (or more accurately, whoever put the data into the computer) was wrong. But the U.S. Supreme Court found that the search was proper because the police were entitled to rely in “good faith” upon the accuracy of inaccurate information. In fact, the Supreme Court held that the police would have been “derelict in his duty” if he did not arrest Evans on the erroneously entered computer information as part of the “often competitive enterprise of ferreting out crime.” Three justices, while agreeing with the result, noted that “it would not be reasonable for the police to rely, say, on a recordkeeping system, their own or some other agency’s, that has no mechanism to ensure its accuracy over time and that routinely leads to false arrests, even years after the probable cause for any such arrest has ceased to exist (if it ever existed).”
The dissenting judges went further, observing that:
Widespread reliance on computers to store and convey information generates, along with manifold benefits, new possibilities of error, due to both computer malfunctions and operator mistakes. Most germane to this case, computerization greatly amplifies an error’s effect, and correspondingly intensifies the need for prompt correction; for inaccurate data can infect not only one agency, but the many agencies that share access to the data base. The computerized data bases of the Federal Bureau of Investigation’s National Crime Information Center (NCIC), to take a conspicuous example, contain over 23 million records, identifying, among other things, persons and vehicles sought by law enforcement agencies nationwide. NCIC information is available to approximately 71,000 federal, state, and local agencies. Thus, any mistake entered into the NCIC spreads nationwide in an instant.
The dissent then relays a litany of cases of people being erroneously arrested based on incorrect information in the NCIC databases, often with that information being retained incorrectly for years. In other cases police were justified in stopping, searching and arresting based on erroneous computer data that showed someone’s insurance status was “unconfirmed,” and the court permitted the arrest and subsequent search. What becomes important is not what is true but what the computer says is true.
In many ways, information security, by focusing on availability and “integrity,” tends to exacerbate this problem. If the computer says it, it must be gospel, since we have systems in place to check and recheck the security of the system. That may be true (usually it isn’t), but it’s essentially irrelevant.
The same problem exists for computer forensics. Forensics, if done correctly, simply shows that the data presented in court or to a tribunal or factfinder is, in fact, the same data as existed on the computers or networks at the time the forensic examiner got to the scene. “These ARE the logs generated by the router/firewall.” “They are in the same condition as when I seized them.” They have integrity. Are they accurate? Who the hell knows? It’s not that the question is unanswerable—it’s that it is rarely asked and even more rarely answered.
The tyranny of computer data lies not in its infallibility, but in the fact that it is perceived as reliable and accurate when it is not. As security professionals, we need to be careful and precise when we speak about or promote the “integrity” of data. It’s only as good as its source.