Home » Security Bloggers Network » Detecting Flaky Tests in CI Pipelines

Detecting Flaky Tests in CI Pipelines

by Serkan Özal on March 31, 2022

4 minutes read

POSTED Mar, 2022

dot

IN
CI/CD

Detecting Flaky Tests in CI Pipelines

Serkan Özal

Written by Serkan Özal

Founder and CTO of Thundra

Flaky tests reduce your confidence in your codebase. When tests fail intermittently, for no discernable reason, it casts doubt on your entire test suite, leaving you wondering which test will be the next to fail unexpectedly.

What’s worse is that due to the nature of these flaky failures, the offending change can be challenging—and expensive—to track down and fix. Instead of reacting to these failures after they’ve occurred, we need to find where the failure was introduced and, from there, identify the root cause.

In this article, we’ll detect flaky tests in a testbed application caused by intermittent failures. We will use a method of error diagnosis that can help us get to the problem changeset more quickly, identifying flaky tests in CI.

Testbed Application Details

To demonstrate the method used to detect flaky tests, I’ve built a small microservice application using the Severless Framewrk and AWS Step Functions. This application has the following features:

One Lambda function that fetches a list of currently playing movies from The Movie Database
One Lambda function that accepts a response from the first function and extracts an element from each entry, returning a list of film titles.

The source code has been configured with CI via GitHub Actions and includes several unit tests describing expected behavior.

Detecting Flaky Tests

To detect flaky tests in our CI pipeline, we’ll use Thundra Foresight for monitoring and analytics on the application’s CI pipeline. Foresight integrates with your application’s GitHub repository and other CI platforms, seamlessly consuming build events and tracking their success or failure.

To integrate with Thundra Foresight, follow these steps:

Sign up for an account
Open the Thundra Foresight product
Follow the provided prompts to create a Foresight project
Connect the project to your source code repository
Connect the project to your CI system, if it’s not already included in your code repository

The entire process can be completed in under 5 minutes, and it requires nothing more than an email to get started.

Once you’ve connected your GitHub repository to the Thundra Foresight platform, the Foresight dashboard will automatically populate with information from your configured CI/CD system. All you need to do is dig in and find the root cause of any problems you’re experiencing.

Intermittent Failures

Intermittent failures can arise from multiple potential sources. The following list, while not exhaustive, represents a good portion of the failure classes you’re likely to see in your test suite:

Time-order dependencies: The success of your tests may depend on the order in which they’re run. This leads to highly coupled test suites that are prone to random failures stemming from relatively minor changes.
Resource dependencies: If your tests are dependent upon a finite resource, they may be exhausted when the test suite runs. This can cause failures unrelated to any code changes being evaluated.
Concurrency issues: If you’re working in a concurrent execution environment, your tests might interact in unexpected ways, as the threads operate on their own control flow paths. Logical concurrency errors, such as objects that are not thread safe, can lead to random failures due to an unpredictable order of execution on the processor.
External resource dependencies: If the test depends on a third-party service, this could lead to seemingly random failures when the third party is having availability problems.

Given that the potential problem domain for flaky tests is very wide, there’s never going to be one fix that covers all test cases. Each failure needs to be evaluated and addressed on its own merits. Often, it’s not the flaky test that is problematic, but the influence of surrounding tests.

When tests start failing, it’s important to identify when the failure began to occur and not just ask “Why is this test failing?” While this is a valid question, in the instance of flaky tests it can lead you down a rabbit hole, where every step works perfectly but the end result is still a failed test suite.

If you’re able to identify where a failure started, you can narrow down the potential changes that led to the error being introduced. This vastly reduces the scope of root cause investigation, speeding the time to resolution for tests with consistency issues.

At the absolute minimum, this time-focused investigation leads you to a “last known good” version of your application, which you can redeploy in a pinch.

Introducing an Intermittent Failure to Our Testbed

To demonstrate the method to detect flaky tests, we’re going to deliberately add a flaky test. This test will check for the presence of the CI environment and, if found, it’ll have a 1-in-20 chance of failing. This is to simulate a live flaky test—in this situation we don’t care why the failure occurred, just that there is one.

Below is the basic Jest test to add to handler.test.js:


test('forced failure in CI for Thundra demonstration', () => {
    if(process.env.APP_ENV == "CI") {
        if(Math.random() < .2) {
            expect(false).toBe(true);
        }
    }
    expect(true).toBe();
});

As you can see, we just check for the presence of the value “CI” in the environment variable APP_ENV. Simply configure this in your CI environment (I’ve configured the demo codebase using GitHub Actions), and you should be able to see the intermittent failure.

This test is a low-effort proxy for a more complex test. If we wanted to expand this into a more robust demonstration of the failure scenario, we could encapsulate a failure case behind a web service that we control and have our test call that instead. There are limitless possibilities, depending on your goals in debugging.

Now, we’ll run the build over and over again until we produce a failure. It’s one of the few times in software where breaking things is the point!

Tracking Down the Issue with Foresight

Navigating to our project in Thundra Foresight, we can immediately see that our build is having issues:

Figure 1: Thundra Foresight repositories

Click on the project to open the detailed view, which shows the results of the last several builds.

Figure 2: Thundra Foresight project dashboard

Using the detail view, find the last known successful build and open the next several runs.

Figure 3: Thundra Foresight project detail dashboard

As you can see, each page provides easy access to all of a changeset’s relevant information, so you can easily jump to the offending code and figure out exactly which test was failing.

Conclusion

Intermittent failures add time and stress to a development team. Due to the many potential sources for a flaky test, there’s no one guaranteed pathway to resolving the issue, even when it’s finally discovered. In these situations, the speed of diagnosis is key to the resolution.

SUBSCRIBE TO OUR BLOG

Get our new blogs delivered straight to your inbox.

Shift-left CI/CD observability for developer’s everyday use!

PRODUCTS

RESOURCES

Foresight Integrations

COMPANY

About Us

Partners

Subscribe to the latest news with Thundra

1 Marina Park Drive STE 1100, Boston, MA

Shift-left CI/CD observability for developer’s everyday use!

PRODUCTS

RESOURCES

Foresight Integrations

COMPANY

About Us

Partners

1 Marina Park Drive STE 1100, Boston, MA

SUBSCRIBE TO OUR BLOG

Get our new blogs delivered straight to your inbox.

THANKS FOR SIGNING UP!

We’ll make sure to share the best materials crafted for you!

4 minutes read

POSTED Mar, 2022

dot

IN
CI/CD

Detecting Flaky Tests in CI Pipelines

Serkan Özal

Written by Serkan Özal

Founder and CTO of Thundra

Testbed Application Details

To demonstrate the method used to detect flaky tests, I’ve built a small microservice application using the Severless Framewrk and AWS Step Functions. This application has the following features:

One Lambda function that fetches a list of currently playing movies from The Movie Database
One Lambda function that accepts a response from the first function and extracts an element from each entry, returning a list of film titles.

The source code has been configured with CI via GitHub Actions and includes several unit tests describing expected behavior.

Detecting Flaky Tests

To integrate with Thundra Foresight, follow these steps:

Sign up for an account
Open the Thundra Foresight product
Follow the provided prompts to create a Foresight project
Connect the project to your source code repository
Connect the project to your CI system, if it’s not already included in your code repository

The entire process can be completed in under 5 minutes, and it requires nothing more than an email to get started.

Intermittent Failures

Intermittent failures can arise from multiple potential sources. The following list, while not exhaustive, represents a good portion of the failure classes you’re likely to see in your test suite:

Time-order dependencies: The success of your tests may depend on the order in which they’re run. This leads to highly coupled test suites that are prone to random failures stemming from relatively minor changes.
Resource dependencies: If your tests are dependent upon a finite resource, they may be exhausted when the test suite runs. This can cause failures unrelated to any code changes being evaluated.
Concurrency issues: If you’re working in a concurrent execution environment, your tests might interact in unexpected ways, as the threads operate on their own control flow paths. Logical concurrency errors, such as objects that are not thread safe, can lead to random failures due to an unpredictable order of execution on the processor.
External resource dependencies: If the test depends on a third-party service, this could lead to seemingly random failures when the third party is having availability problems.

At the absolute minimum, this time-focused investigation leads you to a “last known good” version of your application, which you can redeploy in a pinch.

Introducing an Intermittent Failure to Our Testbed

Below is the basic Jest test to add to handler.test.js:


test('forced failure in CI for Thundra demonstration', () => {
    if(process.env.APP_ENV == "CI") {
        if(Math.random() < .2) {
            expect(false).toBe(true);
        }
    }
    expect(true).toBe();
});

Now, we’ll run the build over and over again until we produce a failure. It’s one of the few times in software where breaking things is the point!

Tracking Down the Issue with Foresight

Navigating to our project in Thundra Foresight, we can immediately see that our build is having issues:

Figure 1: Thundra Foresight repositories

Click on the project to open the detailed view, which shows the results of the last several builds.

Figure 2: Thundra Foresight project dashboard

Using the detail view, find the last known successful build and open the next several runs.

Figure 3: Thundra Foresight project detail dashboard

As you can see, each page provides easy access to all of a changeset’s relevant information, so you can easily jump to the offending code and figure out exactly which test was failing.

Conclusion

SUBSCRIBE TO OUR BLOG

Get our new blogs delivered straight to your inbox.

Testbed Application Details

To demonstrate the method used to detect flaky tests, I’ve built a small microservice application using the Severless Framewrk and AWS Step Functions. This application has the following features:

One Lambda function that fetches a list of currently playing movies from The Movie Database
One Lambda function that accepts a response from the first function and extracts an element from each entry, returning a list of film titles.

The source code has been configured with CI via GitHub Actions and includes several unit tests describing expected behavior.

Detecting Flaky Tests

To integrate with Thundra Foresight, follow these steps:

Sign up for an account
Open the Thundra Foresight product
Follow the provided prompts to create a Foresight project
Connect the project to your source code repository
Connect the project to your CI system, if it’s not already included in your code repository

The entire process can be completed in under 5 minutes, and it requires nothing more than an email to get started.

Intermittent Failures

Intermittent failures can arise from multiple potential sources. The following list, while not exhaustive, represents a good portion of the failure classes you’re likely to see in your test suite:

Time-order dependencies: The success of your tests may depend on the order in which they’re run. This leads to highly coupled test suites that are prone to random failures stemming from relatively minor changes.
Resource dependencies: If your tests are dependent upon a finite resource, they may be exhausted when the test suite runs. This can cause failures unrelated to any code changes being evaluated.
Concurrency issues: If you’re working in a concurrent execution environment, your tests might interact in unexpected ways, as the threads operate on their own control flow paths. Logical concurrency errors, such as objects that are not thread safe, can lead to random failures due to an unpredictable order of execution on the processor.
External resource dependencies: If the test depends on a third-party service, this could lead to seemingly random failures when the third party is having availability problems.

At the absolute minimum, this time-focused investigation leads you to a “last known good” version of your application, which you can redeploy in a pinch.

Introducing an Intermittent Failure to Our Testbed

Below is the basic Jest test to add to handler.test.js:


test('forced failure in CI for Thundra demonstration', () => {
    if(process.env.APP_ENV == "CI") {
        if(Math.random() < .2) {
            expect(false).toBe(true);
        }
    }
    expect(true).toBe();
});

Now, we’ll run the build over and over again until we produce a failure. It’s one of the few times in software where breaking things is the point!

Tracking Down the Issue with Foresight

Navigating to our project in Thundra Foresight, we can immediately see that our build is having issues:

Figure 1: Thundra Foresight repositories

Click on the project to open the detailed view, which shows the results of the last several builds.

Figure 2: Thundra Foresight project dashboard

Using the detail view, find the last known successful build and open the next several runs.

Figure 3: Thundra Foresight project detail dashboard

As you can see, each page provides easy access to all of a changeset’s relevant information, so you can easily jump to the offending code and figure out exactly which test was failing.

Conclusion

Testbed Application Details

To demonstrate the method used to detect flaky tests, I’ve built a small microservice application using the Severless Framewrk and AWS Step Functions. This application has the following features:

One Lambda function that fetches a list of currently playing movies from The Movie Database
One Lambda function that accepts a response from the first function and extracts an element from each entry, returning a list of film titles.

The source code has been configured with CI via GitHub Actions and includes several unit tests describing expected behavior.

Detecting Flaky Tests

To integrate with Thundra Foresight, follow these steps:

Sign up for an account
Open the Thundra Foresight product
Follow the provided prompts to create a Foresight project
Connect the project to your source code repository
Connect the project to your CI system, if it’s not already included in your code repository

The entire process can be completed in under 5 minutes, and it requires nothing more than an email to get started.

Intermittent Failures

Intermittent failures can arise from multiple potential sources. The following list, while not exhaustive, represents a good portion of the failure classes you’re likely to see in your test suite:

Time-order dependencies: The success of your tests may depend on the order in which they’re run. This leads to highly coupled test suites that are prone to random failures stemming from relatively minor changes.
Resource dependencies: If your tests are dependent upon a finite resource, they may be exhausted when the test suite runs. This can cause failures unrelated to any code changes being evaluated.
Concurrency issues: If you’re working in a concurrent execution environment, your tests might interact in unexpected ways, as the threads operate on their own control flow paths. Logical concurrency errors, such as objects that are not thread safe, can lead to random failures due to an unpredictable order of execution on the processor.
External resource dependencies: If the test depends on a third-party service, this could lead to seemingly random failures when the third party is having availability problems.

At the absolute minimum, this time-focused investigation leads you to a “last known good” version of your application, which you can redeploy in a pinch.

Introducing an Intermittent Failure to Our Testbed

Below is the basic Jest test to add to handler.test.js:


test('forced failure in CI for Thundra demonstration', () => {
    if(process.env.APP_ENV == "CI") {
        if(Math.random() < .2) {
            expect(false).toBe(true);
        }
    }
    expect(true).toBe();
});

Now, we’ll run the build over and over again until we produce a failure. It’s one of the few times in software where breaking things is the point!

Tracking Down the Issue with Foresight

Navigating to our project in Thundra Foresight, we can immediately see that our build is having issues:

Figure 1: Thundra Foresight repositories

Click on the project to open the detailed view, which shows the results of the last several builds.

Figure 2: Thundra Foresight project dashboard

Using the detail view, find the last known successful build and open the next several runs.

Figure 3: Thundra Foresight project detail dashboard

As you can see, each page provides easy access to all of a changeset’s relevant information, so you can easily jump to the offending code and figure out exactly which test was failing.