Home » Security Bloggers Network » Another Boeing Software “Glitch”

Another Boeing Software “Glitch”

by C. Warren Axelrod on February 3, 2020

How I hate the word “glitch,” which is commonly used
to describe faulty software in press reports, blogs, and the like. In my
opinion, it trivializes serious software errors.

So, when the word “glitch” showed up on the front page
of the January 18-19, 2020 Wall Street Journal, as in “Boeing
Finds Another Software Problem: Glitch adds to string of technical issues
delaying return pf 737 MAX to service,” written by Andy Pasztor, I thought,
“Here we go again.”

The Y2K issue was often referred to as a “glitch,” but
in reality, it was serious multi-hundreds-of-billion-of-dollars software issue
that threatened to take down companies, government agencies, critical
infrastructures, nations and whatever else running on legacy software that did
not account for the century rollover.

Well, the Boeing problem is not a trivial one either,
and the WSJ article assures us that it is a very severe problem
having to do with booting up their flight-control systems. But that isn’t the
focus of this column. The focus is on the software assurance methodology described
as being used by Boeing. As the article states:

“The software problem occurred as engineers were
loading updated software … into the flight-control computers of a test
aircraft … A software function intended to monitor the power-up process
didn’t operate correctly … resulting in the entire computer system crashing.
Previously, proposed software fixes had been tested primarily in ground-based
simulators, where no power-up problems arose …”

It was when the software was tested in a real-world
aircraft that the issue became painfully apparent.

Quite early in my career, I experienced a somewhat
analogous situation (though without the human-safety risk) when installing the
first digital trader telephone turret in the Eastern U.S. Traders used these
turrets primarily to be able to talk to other traders instantaneously (even
faster than via auto-dial), the idea being that making contact as quickly as
possible favored traders with this capability.

While the subject turret system worked perfectly in
the lab, our system kept crashing. The vendor could not account for this
happening over the course of several weeks, much to the consternation of our
traders, senior management, my staff and me. Eventually the root cause was
ascertained. It turned out that in the lab the data cables were carefully
installed, ran only short distances, and were noise-free. Our field
installation was bigger and engineered to industry standards, which were not as
precise as those achieved in the lab. When the system that we installed
experienced noise on the line, the software switched to an error-handling
routine that should have kept the system up and running despite intermittent
noise. However, because the vendor’s engineers had never experienced noise in
the lab environment, they never got to test that routine. And (wouldn’t you
just know it?) there was a “bug” in that routine, which caused the system to
crash repeatedly.

The lesson here is that you can do all the testing
that you want in the lab, but the ultimate tests are those that take place in
the field. In various contexts, systems behave differently from when they are
in the lab or other well-controlled environments. For security-critical and
safety-critical systems, you have to test under all known potential conditions.
I address many of these issues in greater detail in my book “Engineering Safe
and Secure Software Systems” (Artech House).

As we increase the number and power of cyber-physical
systems, especially in such areas as autonomous road vehicles, it is ever more
important to test extensively, not only in the lab or in simulated
environments, but also under real-world conditions. This is because you can
never be absolutely sure that test environments truly duplicate the environments
in which the systems will actually operate.

By the way, my RSA Conference presentation, in March
2010, with the title “Data Collection and
Analysis Issues with Application Security Metrics,” specifically emphasized the
importance of context when it comes to application security, as does my BlogInfoSec
column “Putting Application Security into Context” dated January 12, 2015 and
available at https://www.bloginfosec.com/2015/01/12/putting-application-security-into-context/
Context is everything when it comes to the safe and secure operation of
applications software and the sooner that reality is understood by software
designers, developers and testers, the better off we all will be.

Coincidentally, MIT professor Nancy Leveson just
published a must-read article: “Inside Risks: Are You Sure Your Software Will
Not Kill Anyone? Using software to control potentially unsafe systems requires
the use of new software and system engineering approaches.” The article, which
appears in the February 2020 issue of Communications of the ACM,
addresses a number of the situations mentioned above plus others, as well as
giving several real-world examples of where lab or simulation testing did not
properly account for certain environments encountered in the field.