A Reflection On ForAllSecure’s Journey In Bootstrapping Behavior Testing Technology
Software security is a global challenge that is slated to grow worse. The application attack surface is growing by 111 billion new lines of software code every year, with newly reported zero-day exploits rising from one-per-week in 2015 to one-per-day by 2021, according to the Application Security Report from Cybersecurity Ventures. Mobile alone has one new application released every 13 seconds.
One common approach to addressing software security issues is applying network filters. This is an easy band-aid. However, we don’t think it is the right long-term solution. Network filters applied by solutions like Web Application Firewalls (WAFs) aim to solve symptoms, not the root cause. When there is a new software vulnerability, we believe the software must be fixed. How do we know which software to fix?
First, we must check software. It is pertinent that we check all software; not just a few programs or those a developer chooses to submit.
Second, we must use the right tools. Most tools today require source code and are built with developers in mind. These solutions can be effective, if developers choose to use it. When security testing tools require source code, end-users are forced to trust the developer to run the tool and fix all problems. This doesn’t allow the IT administrator, the end-user, or the CISO to independently verify the security, safety, and resiliency of the software they buy and use. Shouldn’t they be able to check the software? After all, it’s these users — not the developer — who pay the price for an incident.
I’ve often heard that consumers don’t buy security. Thus, if developers ship exploitable software, no one will know. That’s true today because we do not have the right technology. Once we can check software without a developer, we can give consumers — whether it be a user or enterprise — the facts they need to make informed decisions. That, in turn, we think, will drive a more security-centric software model. As an analogy, imagine you didn’t have crash test data for cars. Then, you likely wouldn’t be compelled to evaluate cars on crash test data. It’s the same in security: if we can give users crash test data for programs, they will be able to make better choices.
We sought to uncover the right solution to address the persistent software security issues that have existed in the market for over two decades. We began our research in a university lab, where a brand new technology was born.
The Mayhem concept was born in my research lab at Carnegie Mellon University, where we explored binary analysis, symbolic execution, and fuzzing. Some of the earliest work we did dates back to 2003, when I was a graduate student. In the last 15 years, we’ve developed new techniques that you’ll find in today’s off-the-shelf code analysis, security analysis, and patching solutions. My co-founders Thanassis Avgerinos and Alexandre Rebert, as well as many other students, spent years publish their work in academic tier-1 venues.
In academia, our research focused on program verification but with a twist. Typical academic program verification takes in a “safety” property and a program. Then, it tries to verify the program is safe. Although these types of research aim to “verify a program is safe”, they frequently prove the opposite. More often than not, researchers reveal the bugs they found. Perhaps subtle, but critical. Decades of research have followed this approach, and we thought it was flawed. Mayhem’s twist is that we check insecurity and verify execution path by execution path. The science behind Mayhem comes from two areas: symbolic execution and fuzz testing.
Symbolic execution translates a program into a logical formula, and then reasons the program’s safety using logic. This original research was significant because it was the first to break the “exponential cliff” barrier. The exponential cliff refers to technical limitations caused by the exponential-size formulas generated by previous techniques. The challenge with previous work was that it would “fork” every time a new program branch was taken. On the contrary, Mayhem was able to generate linear-size formulas because it did not require “forking” to build formulas.
This work appeared at Tier 1 academic venues, such as ACM/IEEE International Conference on Software Engineering, the ACM Conference on Computer and Communications Security, and USENIX Security, and won the ICSE “Best Paper” award. The academic work also resulted in four US patents created by the founders and owned by Carnegie Mellon University: US 9,135,405; US 9,183,396; US 9,542,559; US 9,619,375. All patents have been exclusively licensed to ForAllSecure.
In 2014, we had our Mayhem Symbolic Executor analyze over 38,000 programs from scratch and perform over 209 million tests of those programs. Of the 209 million tests, 2 million resulted in successful hacking of programs. Those 2.6 million successes were the result of 13,875 previously undiscovered bugs. The only cost was Amazon. On Amazon, it cost on average $0.28 for Mayhem to discover and prove each new vulnerability.
This work was a proof-of-concept that demonstrated the power of checking software at scale. Not only did we find serious vulnerabilities, but we also found a way to put an expected cost on finding a new bug. Clearly, more work was needed, but we took it as a positive indication.
Although we wrote far less about it in academia, we acknowledge fuzzing is also a critical technique for finding bugs. In a nutshell, fuzzing chooses a new input, runs the program, and observes whether the execution is safe. Typically “safe” is defined as “not crashing”. Fuzzing has found significant bugs. For example, the OSS-Fuzz program at Google has found over 13,000 bugs in Chrome.
Mayhem uses both symbolic execution and fuzzing. Fuzzing’s strength is in guess-input and speed, allowing users to run hundreds of tests per second. However, it misses the opportunity to deeply reason about each execution. This is where symbolic execution offers value. Symbolic execution does deep reasoning using formal logic. However, it can take seconds or longer on each execution. We found that the key for effective dynamic testing is to use these two techniques together: use deep reasoning of symbolic execution on some runs, while continuously fuzzing in the background. To learn more about the synergistic power of symbolic execution and fuzzing, download the “What is Behavior Testing” whitepaper here.
The Cyber Grand Challenge (CGC) was the next step in maturing research to commercial impact. I’ve previously written why the CGC mattered to me. I often joke that DARPA must have read our papers and thought, “I don’t believe you. Show me at DEFCON”, because that’s what they did. They asked participants to demonstrate that they could find bugs, and go one step further: fix them. The catch? It must be done without source code or human intervention.
I believe the CGC reflected real-life considerations better than academic research. A typical research paper aims to answer whether the program is secure after patching. In the CGC, our primary criteria was performance and functionality. It is, at times, better to remain vulnerable than deploy a patch that doesn’t meet performance or functionality goals. My experience as a CISO was often similar: security was important conceptually, but the “now” impact of performance and functionality outweighed the “someday” impact of a possible compromise.
Prior to the CGC, patching bugs wasn’t something we had done. Others in academia had done some work, but it wasn’t our area of focus. We read the papers and found the techniques useful.
Our goal was to have less than 5% performance impact on every patch and never lose functionality. With only the binary, how could we measure this? “QA” is a hidden component in security. QA is frequently met with yawns in the security community and is even considered a separate market. In Mayhem, QA is integral. In the CGC, we automatically created regression tests and replayed them against every candidate patch. We measured performance and ensured, to the extent we could, no functionality was lost. I believe one of the reasons we won CGC was our QA. It was that important. Others had great techniques for finding and fixing vulnerabilities. We found Mayhem made much better business decisions, such as not fielding a performance-expensive patch as a default. Mayhem made intelligent choices based upon quantitative data that was inline with a success strategy.
CGC was completely autonomous — no intervention was permitted in the find, fix, QA, and decision-making cycle of software security. We are proud to have won DARPA’s $60M experiment and mathematical evaluation of autonomous technology. We took our $2M prize money to bootstrap the evolution of Mayhem from a DARPA demonstration to a product.
We received an overwhelming response from enterprises, institutions, and individuals across various industries after our DARPA CGC win. The market need was undeniable, and we began designing our product.
We quickly realized we couldn’t transition the entirety of the Mayhem prototype, as fielded in the CGC, to market. Market validation revealed the technology that automatically found defects, created test suites, and performed QA on patches were the most desired. We have prioritized this, with the plan of bringing the rest of the Mayhem prototype to production over time.
When you bring a product to market, all the messy, real-life considerations that are abstracted in academia and DARPA work must be taken into account. For example, in CGC…
- …we didn’t have to worry about ingesting a program. DARPA designed a whole infrastructure for that. As a product, we have to develop easy-to-use ways to ingest new software.
- …we had 7 system calls, which are representative theoretically of real systems. As a product, we have to deal with the complexity of each OS. We are focusing on Linux today, as we believe it will be the best user experience out-the-door.
We’re not designing Mayhem as a product in isolation. We have several design partners who committed early on to help realize the vision. The Department of Defense has been a large influencer. They acutely feel the pain from a) having mission-critical systems to secure, and b) having to deal with both having no source code and fitting into DevOps situations. In 2017, we partnered with Defense Innovation Unit, a progressive group within the Department of Defense, to adapt Mayhem’s capabilities into both left-of-ship development processes and right-of-ship testing and validation processes. Our multi-million dollar contract with the Department of Defense became non-dilutive funding for ForAllSecure, continuing our ability to grow while remaining profitable.
In addition to DIU, we are also collaborating with design partners in automotive, Internet of Things (IoT), and high-tech industries to understand how Mayhem is used to secure the software they develop, as well as the software they integrate from third-party suppliers, including partners and open source. Our collaboration with design partners has helped us develop Mayhem into a scalable, enterprise-grade platform with broader architecture support, DevOps integration, and enhanced usability, allowing security, testing, and development teams to bring powerful dynamic security testing into their software lifecycle.
Over the coming year, ForAllSecure will make its solution available to mainstream companies and eventually to end-user consumers. In our first steps, we’re focusing on those with critical systems, as well as those already familiar with behavior testing techniques, such as fuzzing or symbolic execution. Today, the current version of Mayhem is available as a part of our early access program. Contact us at https://forallsecure.com/early-access/ if you’re interested in learning more.
Additionally, we are scaling our engineering and business teams across our offices in Palo Alto CA, Pittsburgh PA, and Washington D.C. to continue evolving Mayhem. We are expanding our team with people passionate about building autonomous software security. Visit our Careers page to explore open roles.
The last two years have been a remarkable journey. I am incredibly proud of what we’ve accomplished for the company and the Mayhem solution. I am grateful to our design partners as well as the ForAllSecure team.
In the next installment of this series, I’ll share more about my vision for ForAllSecure’s future and what’s in store for the next stage of our growth. Stay tuned!
*** This is a Security Bloggers Network syndicated blog from Latest blog posts authored by David Brumley. Read the original post at: https://forallsecure.com/blog/a-reflection-on-forallsecures-journey-in-bootstrapping-behavior-testing-technology