Everything You Ever Wanted To Know About Test-Case Reduction, But Didn’t Know to Ask

Imagine reducing the amount of code and time needed to test software, while at the same time increasing the efficacy of your tests and making your debugging tasks easier—all with minimal human effort. It seems too good to be true, but we’re going to explain how test-case reduction can do all this (and maybe more).

Understanding how reduction works can help with troubleshooting, and makes it easier to figure out an efficient workflow and the best tools to optimize your tests. We’ll explain why test-case reduction is an especially important topic for security engineers to understand, and take a look at DeepState’s state-of-the-art reducer.

Test-case reduction for humans

The most common purpose for test-case reduction is to take a complicated failing test case for a confirmed bug, and turn it into one that is much easier to understand in order to make debugging easier. But you may also want to dismiss low-priority bugs! Having a simple version of such bugs helps quickly identify future duplicate submissions of the same, unimportant problem. For unconfirmed bugs, reducing test cases can be even more critical, because you often can’t tell if a bug is a duplicate (or even a bug at all) until you’ve simplified it. You may discover that the problem is with your specification, test generation tool, or operating system.

Without a reduced test case that can be easily understood and succinctly described, it’s hard to even say that you have a bug. You may have evidence that a software system is behaving unexpectedly, but when the test case is complex enough “a bug” can be a dubious concept. A complex test case is almost analogous to a mathematical proof that only shows that an integer with a certain property must exist; your proof needs to be constructive if you want to do something that requires an actual number.

Suppose you have a very large, randomly-generated HTML file that crashes Chrome. You’ve possibly found an important bug; or maybe you just did something that caused Chrome to run out of memory in an expected way. If your only conclusion is, “When I load this huge file in Chrome, it stops running,” you don’t really know much at all. In this case, you may want to apply a test-case reducer instead of spending time looking at core files and attaching debuggers. If it reduces the file down to a single page of HTML or, better yet, reduces to something as small as a single SELECT tag, you have a shortcut to understanding what is going on.

Test-case reduction has direct applications in cybersecurity, especially when fuzzers are used to craft exploits from crashes generated by random inputs. The simpler the input, the easier it will be to construct an exploit or realize that a bug cannot be exploited. AFL and libFuzzer provide built-in limited test-case reduction, but sometimes you need more than that. Modern test-case reduction tools can simplify this analysis, and are probably essential if you want to produce complex sequences of API calls to find vulnerabilities in a library like TLS, SQLite, or LevelDB. This concept also extends to fuzzing smart contracts, which is why Echidna, Trail of Bits’ smart contract fuzzer, includes a test-case reducer.

Test-case reduction for machines

Test-case reduction does more than just make tests easier for humans to read; it’s useful for core cybersecurity tasks like fuzzing, symbolic execution, and bug triage. Reducers can:

If you’d like to search the literature on test-case reduction, it’s also referred to as minimization, shrinking, and, originally, delta-debugging. All of these names refer to the same process of taking a program input A that does X, and producing a smaller input, B, that also does X.

Using a test-case reducer

Let’s see test-case reduction in action. Bring up the DeepState docker image, and install the testfs repository, which is a DeepState harness for a user-mode ext3-like file system.

> git clone> cd testfs> cmake .> make

Our DeepState harness lets us simulate device resets in the middle of operations, and check for problems caused by interrupting a file system call. It checks that after a reset, the file system can still be mounted. To generate a test case showing that this isn’t always true, we can just use DeepState’s built-in fuzzer:

> mkdir failure> ./Tests --fuzz --output_test_dir failure --exit_on_fail --seed 10

DeepState will report a problem, and save the resulting test case in a file with a unique file name ID and .fail extension. Let’s look at the sequence producing the file system corruption. For brevity, we show only the actual test steps below.

> ./Tests --input_test_file failure/dbb393e55c77bac878ab06a02a022370e33761cb.failTRACE: Tests.cpp(115): STEP 0: tfs_lsr(sb);TRACE: Tests.cpp(140): STEP 1: tfs_stat(sb, ".a.BBA");TRACE: Tests.cpp(146): STEP 2: tfs_cat(sb, "/aAb.BaAb");TRACE: Tests.cpp(140): STEP 3: tfs_stat(sb, "");TRACE: Tests.cpp(115): STEP 4: tfs_lsr(sb);TRACE: Tests.cpp(103): STEP 5: tfs_rmdir(sb, "BB");TRACE: Tests.cpp(146): STEP 6: tfs_cat(sb, "b");TRACE: Tests.cpp(110): STEP 7: tfs_ls(sb);TRACE: Tests.cpp(95): STEP 8: tfs_mkdir(sb, "A");TRACE: Tests.cpp(146): STEP 9: tfs_cat(sb, "a./b");TRACE: Tests.cpp(103): STEP 10: tfs_rmdir(sb, "AaBbBB.A.");TRACE: Tests.cpp(130): STEP 11: tfs_write(sb, "BA/BB/", "yx");TRACE: Tests.cpp(140): STEP 12: tfs_stat(sb, "bba");TRACE: Tests.cpp(155): STEP 13: set_reset_countdown(4);TRACE: Tests.cpp(140): STEP 14: tfs_stat(sb, "/A");TRACE: Tests.cpp(121): STEP 15: tfs_create(sb, "bA");

This output shows the 16 steps taken to reach an assertion violation on line 252 of super.c, when we try to remount the file system after the reset. But are all of these steps necessary? Do we really need to cat the file "a./b" for this to happen? We’ll use DeepState’s reducer to find out.

> deepstate-reduce ./Tests failure/ failure/

You’ll see output like:

Original test has 8192 bytesApplied 75 range conversionsLast byte read: 307Shrinking to ignore unread bytesWriting reduced test with 308 bytes to failure/ #1 0.18 secs / 2 execs / 0.0% reductionStructured deletion reduced test to 304 bytesWriting reduced test with 304 bytes to failure/shrink.fail0.36 secs / 3 execs / 1.3% reduction================================================================================Structured deletion reduced test to 272 bytesWriting reduced test with 272 bytes to failure/shrink.fail0.5 secs / 4 execs / 11.69% reduction================================================================================Structured deletion reduced test to 228 bytesWriting reduced test with 228 bytes to failure/shrink.fail0.6 secs / 5 execs / 25.97% reduction…1-byte chunk removal: PASS FINISHED IN 0.24 SECONDS, RUN: 1.45 secs / 57 execs / 95.78% reduction4-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.53 secs / 70 execs / 95.78% reduction8-byte chunk removal: PASS FINISHED IN 0.08 SECONDS, RUN: 1.61 secs / 83 execs / 95.78% reduction1-byte reduce and delete: PASS FINISHED IN 0.02 SECONDS, RUN: 1.62 secs / 86 execs / 95.78% reduction4-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 88 execs / 95.78% reduction8-byte reduce and delete: PASS FINISHED IN 0.01 SECONDS, RUN: 1.64 secs / 89 execs / 95.78% reductionByte range removal: PASS FINISHED IN 0.31 SECONDS, RUN: 1.96 secs / 141 execs / 95.78% reductionStructured swap: PASS FINISHED IN 0.01 SECONDS, RUN: 1.96 secs / 142 execs / 95.78% reductionByte reduce: PASS FINISHED IN 0.1 SECONDS, RUN: 2.06 secs / 159 execs / 95.78% reduction================================================================================Iteration #2 2.06 secs / 159 execs / 95.78% reductionStructured deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.08 secs / 161 execs / 95.78% reductionStructured edge deletion: PASS FINISHED IN 0.01 SECONDS, RUN: 2.09 secs / 163 execs / 95.78% reduction================================================================================Completed 2 iterations: 2.09 secs / 163 execs / 95.78% reductionPadding test with 23 zeroesWriting reduced test with 36 bytes to failure/

After a few seconds, we can run the new, smaller test case:

> ./Tests --input_test_file failure/shrink.failTRACE: Tests.cpp(155): STEP 0: set_reset_countdown(4);TRACE: Tests.cpp(121): STEP 1: tfs_create(sb, "aaaaa");CRITICAL: /home/user/testfs/super.c(252): Assertion (testfs_inode_get_type(in) == I_FILE) || (testfs_inode_get_type(in) == I_DIR) failed in function int testfs_checkfs(struct super_block *, struct bitmap *, struct bitmap *, int)ERROR: Failed: TestFs_FilesDirsERROR: Test case failure/ failed

To “break” testfs under reset interruption, we need only cause a reset at the fourth write to the block when the file "aaaaa" is created. Using a debugger or logging statements to understand this behavior will obviously be a much more pleasant experience than with the original test case.

For this bug, we only needed to give deepstate-reduce:

  • the DeepState harness executable (./Tests),
  • the test case to reduce (, and
  • the new, reduced test case to generate (failure/

But sometimes we’ll need to provide more information. For example, a test case may “change bugs” during reduction if all we require is that the reduced test fails. The reducer can take an additional requirement, in the form of a string or regular expression that must appear in the output, or a required exit code.

Reducing with respect to code coverage

Reducers can also be used in the absence of failure. For example, we might want to take a complicated test case that covers a hard-to-reach line of code—perhaps generated by a fuzzer or symbolic execution tool—and make it easier to follow. A good reducer can do that too, by using a revised definition of what we consider an interesting test case.

First, we’ll generate some passing tests:

> mkdir coverage> ./Tests --fuzz --output_test_dir coverage --timeout 10 --fuzz_save_passing --seed 1

This creates a file, coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass, that produces a file system that mounts, but has inconsistent inode and block freemaps. We could just use --criterion to reduce the test case with respect to that interesting fact alone, but let’s produce a test case that retains all the code coverage of our test.

First, we compile a version of the harness and testfs with code coverage instrumentation enabled:

> clang -c -fprofile-instr-generate -fcoverage-mapping *.c> clang++ -o TestsCov Tests.cpp -fprofile-instr-generate -fcoverage-mapping -ldeepstate *.o

Next we run our test with coverage instrumentation and collect the results:

> rm default.profraw> ./TestsCov --input_test_file coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass> llvm-profdata-6.0 merge -o testscov.profdata default.profraw> llvm-cov-6.0 show ./TestsCov -instr-profile=testscov.profdata *.c Tests.cpp  >& covout

Then we can use the reducer to produce a new test with the same coverage:

> deepstate-reduce python coverage/659042175e31c125dfb1182404526b7c10d53ec8.pass coverage/smallcov.pass --cmdArgs " @@" --exitCriterion 0

Now the reducer runs a Python script to determine how interesting the results are. It will take a lot longer to finish, since preserving coverage is usually harder than preserving a specific failure. Comparing the file system calls made in the two test cases (which, remember, have identical coverage), we can see what this additional overhead has brought us. The original test case does all this…

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "B/Aa/aabA.");TRACE: Tests.cpp(115): STEP 2: tfs_lsr(sb);TRACE: Tests.cpp(121): STEP 3: tfs_create(sb, "AbB/BAb");TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(4);TRACE: Tests.cpp(158): STEP 5: set_reset_countdown() ignored; already setTRACE: Tests.cpp(95): STEP 6: tfs_mkdir(sb, ".BB");TRACE: Tests.cpp(146): STEP 7: tfs_cat(sb, "b./aBa..");TRACE: Tests.cpp(103): STEP 8: tfs_rmdir(sb, "");TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "bbbA/.abA");TRACE: Tests.cpp(95): STEP 10: tfs_mkdir(sb, "..a");TRACE: Tests.cpp(110): STEP 11: tfs_ls(sb);TRACE: Tests.cpp(110): STEP 12: tfs_ls(sb);TRACE: Tests.cpp(95): STEP 13: tfs_mkdir(sb, "A.");TRACE: Tests.cpp(155): STEP 14: set_reset_countdown(5);TRACE: Tests.cpp(158): STEP 15: set_reset_countdown() ignored; already setTRACE: Tests.cpp(140): STEP 16: tfs_stat(sb, "B..Aaab./b");TRACE: Tests.cpp(158): STEP 17: set_reset_countdown() ignored; already setTRACE: Tests.cpp(140): STEP 18: tfs_stat(sb, "/AaA");TRACE: Tests.cpp(146): STEP 19: tfs_cat(sb, ".");

…whereas the reduced test case makes API calls only when required to cover the code, and uses a far less bizarre set of paths:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");TRACE: Tests.cpp(155): STEP 11: set_reset_countdown(1);TRACE: Tests.cpp(158): STEP 12: set_reset_countdown() ignored; already setTRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");TRACE: Tests.cpp(95): STEP 15: tfs_mkdir(sb, "");TRACE: Tests.cpp(95): STEP 16: tfs_mkdir(sb, "");TRACE: Tests.cpp(95): STEP 17: tfs_mkdir(sb, "");TRACE: Tests.cpp(95): STEP 18: tfs_mkdir(sb, "");TRACE: Tests.cpp(95): STEP 19: tfs_mkdir(sb, "");

Steps 15-19 can be ignored; they simply reflect that our file system tests always perform 20 steps, unless there is a failure before that point. It’s clear now that the multiple calls to cat, ls, and so forth in the original version were redundant. Taking this test, adding assertions about expected return values, and maintaining the resulting high-coverage unit test would be a much more reasonable task than dealing with the chaos produced by the fuzzer alone. It’s also obvious that steps 11 and 12 are not important to include in a unit test, because the reset never happens. You only really need to understand and write asserts for this trace:

TRACE: Tests.cpp(95): STEP 0: tfs_mkdir(sb, "B");TRACE: Tests.cpp(95): STEP 1: tfs_mkdir(sb, "aa");TRACE: Tests.cpp(95): STEP 2: tfs_mkdir(sb, "aaa");TRACE: Tests.cpp(103): STEP 3: tfs_rmdir(sb, "");TRACE: Tests.cpp(155): STEP 4: set_reset_countdown(3);TRACE: Tests.cpp(95): STEP 5: tfs_mkdir(sb, "B/aa/a");TRACE: Tests.cpp(110): STEP 6: tfs_ls(sb);TRACE: Tests.cpp(115): STEP 7: tfs_lsr(sb);TRACE: Tests.cpp(121): STEP 8: tfs_create(sb, "aaaa");TRACE: Tests.cpp(140): STEP 9: tfs_stat(sb, "a");TRACE: Tests.cpp(146): STEP 10: tfs_cat(sb, "a");TRACE: Tests.cpp(146): STEP 13: tfs_cat(sb, "aaa");TRACE: Tests.cpp(95): STEP 14: tfs_mkdir(sb, "");

Coverage isn’t the only property other than failure that we can preserve. Recent work on automatic resource-adaptation for software reduces actual programs, not just their test cases. Reduced programs do less, but use fewer resources and still pass all important tests. We are investigating producing versions of a program that avoid calling insecure APIs only required for optional functionality, thus reducing the attack surface of the system.

How does it do that?

To a user, test-case reduction can appear to be magic. You give a reducer a huge, extremely complex test case, and, after some time, it produces a much smaller test case that achieves the same goal. In many cases, the process doesn’t take very long, and the resulting test case is, as far as you can see, not even a simple chunk of the original test case. How does this happen?

At a high level, almost all approaches to test-case reduction use some variant of an extremely simple algorithm:

Given a test case, T_INIT, to reduce, and a function PRED(test) that checks if a test case is a valid reduction (e.g., if it fails in the same way as T_INIT or covers the same source code), we can reduce T_INIT as follows:

1. Set T_CURR = T_INIT2. Let T_NEXT = a new variation of T_CURR (one that has not yet been tried)If there are no new variations of T_CURR, stop and report T_CURR as the reduced test-case.3. Add T_NEXT to the set of variations that have been tried.4. If PRED(T_NEXT), set T_CURR=T_NEXT.5. Go to 2.

The devil is in the details, particularly those of step two. A completely naive approach to test-case reduction would consider all possible subsets or subsequences of T_CURR as the set of variations. This would be good and thorough except for the minor problem that it would take approximately forever to reduce even fairly small test cases.

There are also some subtle points about the design of the reduction criteria function, PRED, for systems that have more than one bug (which, in reality, includes most complex software). If the definition of “fails in the same way as T_INIT” is too restrictive, we’re unlikely to get much reduction. For example, forcing the exact same failure output might be a bad idea when an assertion includes some input value, or references a position in T_INIT or an address in memory.

On the other hand, if our criteria are too weak, we may run into slippage. Slippage is the annoying phenomenon where a test case for a complex, subtle new bug reduces to a test case for a simple, and likely already known, bug. Avoiding slippage using reduction criteria requires some intuition about bugs and the system under test. More sophisticated approaches to avoiding slippage require modifying the reduction algorithm, and are related to the idea that “test-case reduction is fuzzing,” a topic beyond the scope of this blog post. Here, we’re going to assume you have PRED, and it does what you want it to do.

A good first step: Modified binary search

Rather than considering all possible subsets of a test case as potential shorter versions, we can divide and conquer through a modified binary search, which is the heart of an approach called delta-debugging. First, we try to see if just the first half of our test case satisfies our reduction criterion. If not, we try the second half of the test case. If neither satisfies PRED, what do we do? Well, we can just increase the granularity, and try removing quarters of the original test case, then eighths, and so on, until we’re trying to remove whatever our “atomic” parts are, and we can’t do anything else.

The DeepState reducer

DeepState’s reducer doesn’t use Zeller’s delta-debugging binary search strategy, which is often less effective than simpler greedy approaches. It instead produces variants of a test case by applying a series of passes similar to compiler passes:

  • Structured deletions: DeepState inputs are not always treated as uninterpreted byte buffers. DeepState knows when the bytes read within the context of a single OneOf call begin and end, and tries to delete such blocks accordingly. While the files themselves do not contain any hints of this structure, DeepState actually runs each test case T_CURR and dynamically extracts the structure. Deletions according to OneOf boundaries often correspond to removing a single function call, e.g., reducing foo(); bar(); foo; to foo(); foo();. Using as much structure as possible to make likely meaningful reductions is a key optimization in reduction methods. In addition to DeepState structure, our reducer also knows common delimiters, such as whitespace, quotes, commas, and brackets, used in source code, xml, json, and so forth.
  • Structure edge deletions: This pass tries to remove the boundaries between structured parts of the test, e.g., merging two adjacent code blocks into one.
  • One-, four-, and eight-byte removals: This pass is the first that doesn’t use any structure from the test. It just tries deleting single bytes, then four-byte sequences, then eight-byte sequences (e.g., string components, 32-bit values, and 64-bit values).
  • One-, four-, and eight-byte reduce and deletes: This pass reduces the value of a single byte by one, then tries to delete a sequence of bytes after the reduced byte. This covers the common case where a value specifies the length of some field, or number of times to run a loop, and then the following bytes are the actual values.
  • Byte-range removals: This pass tries removing increasingly larger byte-ranges, skipping over the one, four, and eight special cases that were already tried. DeepState has a user-configurable upper limit (16 by default) on the size of the range. This is the brute-force core of the reducer.
  • Structure swaps: This is the first pass that cannot reduce test length. It tries to swap adjacent structures whenever this makes the test’s bytes more sorted.
  • Byte reductions: This pass takes each byte of a test case and tries to make it as close to zero as possible.
  • Byte pattern search: By default, DeepState doesn’t apply this expensive pass, but when the reducer is asked to aim for maximum reduction, it will search for all two-byte patterns that appear more than once, and try to remove part of the pattern in multiple places at once. The idea is that if two function calls need the same input bytes (e.g., a filename), this can replace both with a smaller matching input. The original byte patterns may be larger than two bytes, but repeated application of the pass can handle the largest patterns.

DeepState repeatedly interleaves these passes, using later passes only when the earlier passes in the list are no longer helping with reduction, until no reduction is possible or the time allowed for reduction has been exceeded. While DeepState’s reducer is best at reducing DeepState tests, the ability to run an arbitrary program to check for interestingness means you can also use it to reduce other kinds of test cases.

Six test-case reducers you should know about

If you’re interested in test-case reduction, you should at least know about these reducers in addition to DeepState:

  • Delta-debugging in Python. I personally used this, Andreas Zeller’s original implementation, to reduce thousands of test cases for the file system on the Curiosity Mars Rover, and it made understanding and debugging even the nastiest bugs discovered by random-testing a flash file system fairly easy.
  • CReduce. This is widely used in compiler testing and was a major influence on the DeepState reducer. It can reduce more than just C, and is a recent entry in a long line of more structure-aware reducers, starting with the HDD (Hierarchical Delta-Debugging) algorithm. The CReduce paper and John Regehr’s blog posts are good resources to learn more about CReduce.
  • Hypothesis. Hypothesis is a popular Python test generation tool informed by David MacIver’s ideas about test reduction. A key insight of MacIver’s is that reducers should order tests in shortlex order, not only considering shorter tests better, but preferring equal-length tests that are “simpler.” DeepState also focuses on shortlex ordering. I haven’t yet tried David’s brand-new tool, ShrinkRay, but given what I know of his shrinker savvy, it’s likely to be very good.
  • QuickCheck. QuickCheck’s success in finding bugs in Haskell programs arguably started the modern renaissance in random testing, which, with DeepState, is beginning to merge with the renaissance in fuzzing. QuickCheck “shrunk” generated inputs to produce small counterexamples, and its successors continue or expand on this tradition.
  • Google’s halfempty. While it wasn’t the first test-case reducer to try to use parallelism, halfempty is the most focused on getting the most out of multiple cores when reducing huge inputs.
  • TSTL. This Python property-based testing system introduced the idea of normalization, which works to make test cases simpler as well as shorter. For instance, normalization will replace large integers with smaller values, and group API calls together to improve readability. Normalization is a formalization of one approach to MacIver’s emphasis on shortlex ordering.

As it turns out, normalization also helps with reduction. Test-case reducers can easily get stuck in local minima, where no simple changes can reduce test size. Normalization can “unstick” things: The simplification often removes obstacles to reduction. Once anything resembling normalization is on the table, including many passes in CReduce, Hypothesis, or DeepState, a “reduced” test case may not overlap with the original test case at all.

TSTL and SmartCheck also use normalization to assist in generalization, an approach that takes a failing test case and distinguishes the essential aspects of the test case (“this function call must have zero for this parameter”) from the accidental aspects (“any number will do here, zero is just a really simple one”) in order to make debugging and understanding even easier. A generalization tool for DeepState is on our to-do list.

While the above reducers are powerful and important, you may want to primarily use DeepState’s own reducer when reducing DeepState tests, because it’s the only one that uses dynamic feedback about structure from DeepState test execution. For extremely large inputs, it may be a good idea to apply halfempty or ShrinkRay first, to take advantage of multiple cores; then you can use DeepState to go the last mile.


Test-case reduction is a powerful tool to have in your testing utility belt, and is constantly operating behind the scenes in fuzzers like AFL and libFuzzer. Knowing all the things that test-case reduction can do for you can improve the effectiveness of your tests and make debugging a much more pleasant task. DeepState includes a state-of-the-art reduction tool, and we encourage you to play with it, using reduction criteria of your own invention.

We’re always developing new tools to make finding and fixing issues easier. Need help with your next project? Contact us!

*** This is a Security Bloggers Network syndicated blog from Trail of Bits Blog authored by Alex Groce. Read the original post at: