SBN

The Evolution of Bots Forging CAPTCHAs, a Firsthand Report

A few weeks ago, we announced the release of our own DataDome CAPTCHA, a secure and private alternative to existing traditional CAPTCHAs. The DataDome CAPTCHA is one of the possible responses triggered when our real-time detection engine detects malicious bot activity.

In this article, we will explore why and how our CAPTCHA is more private, easier to use, and more secure than traditional CAPTCHAs. You’ll learn how we designed the DataDome CAPTCHA to best balance security and privacy in terms of the types of signals collected and the way they are processed. Finally, we will share the new findings we discovered by launching our CAPTCHA on dozens of e-commerce websites and mobile applications.

Why is the DataDome CAPTCHA more private?

First and foremost, DataDome is NOT an advertising or marketing company. Our business model is to secure websites, mobile applications, and APIs against online fraud and automated threats—not to collect or monetize user data. 

Therefore, we do not collect any personal data. Our solution uses the minimal data necessary to protect online businesses and their end-users against malicious attacks. Any data signals processed by our solution are used for security and anti-fraud purposes only.

Moreover, data collected is stored using the top security standards. The default data retention period of 30 days is adjustable in our customer dashboard to meet local regulations.

Why does DataDome’s CAPTCHA provide a better UX?

First, it’s important to DataDome that we only show CAPTCHAs to bots. Thus, more than 9.99 times out of 10, human users never see any CAPTCHA. And even though we only show a CAPTCHA to a human 0.01% of the time, we still want to minimize the impact on the user experience (UX) for legitimate, human users.

To solve DataDome’s CAPTCHA, the user must slide a button right to place a puzzle piece into the correct position on an image. It’s straightforward—no language barrier or complex challenge.

DataDome CAPTCHA Screenshot

At DataDome, we also take accessibility seriously. That’s why we also offer an audio-accessible CAPTCHA in which the goal of the challenge is to type a list of numbers as they are played aloud. It’s available in 13 languages (more options than other CAPTCHAs), and we continue to add more as we expand to new parts of the world.

At this point, you may wonder how such a simple challenge could be secure. We’ll explain in more detail below, but here is a simple way to summarize our methodology: 

We designed our CAPTCHA with the understanding that the security of the CAPTCHA should not rely solely on the security of the challenge.

How is DataDome’s CAPTCHA more secure?

Online, there tends to be a tradeoff between usability (UX) and security: Making things more secure tends to make them less usable and less accessible. 

One way we could have designed our new CAPTCHA for optimal security might have been to make our challenge more complex and rely solely on the complexity of the challenge to secure the CAPTCHA. We could have used 3D shapes to compose complex image recognition problems, or we could dynamically generate difficult cognitive challenges.

However, this approach would have several drawbacks:

  1. It would make the CAPTCHA less accessible.
  2. It would severely impact UX in the rare case of false positives.
  3. Recent research conducted by researchers of the Xidian and Linkoping universities published at Usenix security 2021 show that even really complex 3D CAPTCHAs can be solved by advanced neural networks architecture.
  4. We have customers all around the world, and we know that not all people think the same. We needed something language independent and easy to understand, no matter your age or culture.

That’s why we decided to adopt a new paradigm when designing our CAPTCHA:

Instead of relying solely on the difficulty of the CAPTCHA challenge (which would negatively affect UX), we decided to add an extra layer of security using invisible signals. 

We leverage our bot detection expertise to collect browser fingerprinting, behavioral signals, and reputational signals that are completely invisible to human users, allowing us to guarantee end-users get a straightforward CAPTCHA UX while bot developers face their worst nightmare.

We can’t reveal the exact signals leveraged in our CAPTCHA (attackers would like that). But we can review the types of signals we collect and how they help us detect bots attempting to pass CAPTCHAs.

Fingerprinting Signals:

  • Browser Fingerprints: Collected using JavaScript, browser fingerprints enable us to quickly and accurately catch all popular automation frameworks and headless browsers frequently used to make bots, such as: Headless Chrome, Puppeteer, and Selenium (as well as their modified versions/counterparts).
  • Server-Side Fingerprints: We collect several signals on the server side, such as HTTP headers and TLS fingerprints.

Behavioral Signals:

  • Using JavaScript, we collect different behavioral signals related to user interaction with the page in order to detect bots trying to mimic human behavior. This includes information about mouse movements, touch events, scrolls, etc.
  • We integrate anti-replay mechanisms in case bot developers try to simply replay real human interactions.

Reputational Signals:

  • We leverage several types of reputation signals computed by our real-time bot detection engine, such as IP and session reputation.
  • We leverage our machine learning models to detect advanced residential proxies.

Anti-CAPTCHA Farms:

  • CAPTCHA farms are a key element that has been taken into consideration with the design of our CAPTCHA since the beginning. Because we also protect platforms against bots, we can correlate information obtained on any protected endpoint with information obtained when displaying our CAPTCHA and when receiving the CAPTCHA response.
  • Our CAPTCHA was purposefully built in a way that makes it extremely difficult to outsource its resolution to a third-party service.

DataDome CAPTCHA vs. Bots in the Wild

As of August 2022, DataDome’s CAPTCHA has been used in production for several weeks with ~30 customers, including mostly e-commerce websites and mobile applications. Let’s explore what we observed in the weeks following the deployment of our CAPTCHA on highly-targeted websites and mobile applications.

Fewer bot/malicious CAPTCHA-passing attempts.

Each time we switched a customer to our CAPTCHA, we observed a significant drop in CAPTCHA-passing attempts coming from bots. It makes sense, given that bot developers need to update their bots (update the CSS selectors, etc.) to properly interact with our new CAPTCHA. However, we notice the decrease remains stable even after a month. 

CAPTCHA Forging Attempts Over Time

Malicious CAPTCHA forging attempts over time: We observe a significant drop following the activation of the DataDome CAPTCHA.

Our hypothesis, based on the data we collect, is that a majority of bot developers rely on popular open source projects/off-the-shelf tools (CAPTCHA farms) to forge CAPTCHAs. Thus, as long as the easily available tools don’t offer any options to solve the DataDome CAPTCHA, we expect the number of malicious CAPTCHA passing attempts will remain lower than before.

How fast did bot developers try to forge CAPTCHAs?

It took between 6 hours to ~2 weeks, depending on the website and mobile application. 

The fastest attempt to forge the CAPTCHA (6h after implementation) happened on a popular e-commerce platform heavily targeted by distributed scrapers. Six hours after switching to the new CAPTCHA, we detected bots trying to submit CAPTCHA challenges, though they were blocked for several reasons (such as inconsistent browser fingerprints linked to instrumentation frameworks and other bad behaviors). That’s how fast attackers will adapt to try to obtain data. 

The good news is, since DataDome’s primary purpose is to protect websites and mobile apps against fraudulent traffic, we’re used to continuously finding new bot signals and improving our ML models to stay ahead of bots. We’ve been doing it for years to improve our real-time detection engine, and now it will also continue to strengthen our CAPTCHA.

How do bots attempt to forge DataDome CAPTCHAs?

Audio API: First, we observed evidence of a known issue—accessibility vs. security. We know that audio CAPTCHAs are often more exploited than their image-based counterparts, which was also evident with our CAPTCHA. However, with behavioral and fingerprinting signals, we can still invalidate a forged CAPTCHA, even when the response to the challenge is correct.

Non-Modified Puppeteer: Puppeteer is a popular automation framework to instrument (headless) Chrome. It’s no surprise that we encounter it frequently among bots that try to forge our CAPTCHA. Bots use standard APIs provided by Puppeteer to mimic fake mouse movements and clicks. However, the behavior deviates from legitimate users, which—combined with fingerprinting signals—allows us to invalidate CAPTCHAs passed by Puppeteer.

Puppeteer Extra Stealth: Puppeteer extra stealth is a popular bot automation framework that adds a layer of features on top of Puppeteer. Its API is compatible with Puppeteer, but includes features to spoof your fingerprint and simple integrations with CAPTCHA farm APIs, such as 2Captcha. The stealth plugin is popular among bot developers and bots as a service (BaaS). 

Similarly to Puppeteer, our CAPTCHA collects behavioral and fingerprinting signals that enable us to invalidate CAPTCHAs passed by Puppeteer extra stealth bots, even if they submit a CAPTCHA with a valid response.

Users With 2Captcha Extension: Our CAPTCHA client-side JavaScript code has also detected the presence of instrumented browsers that use the 2Captcha auto solver browser extension. However, it doesn’t help bots because 2Captcha doesn’t support any integration for our CAPTCHA. It only makes it easier for us to invalidate forged CAPTCHAs.

So far, we don’t see a significant volume of Selenium-based bots attempting to forge the DataDome CAPTCHA.

CAPTCHA-Forging Attempts by Bots Over Time

The graph below shows the evolution of bot forging attempts on DataDome’s CAPTCHA. We see that bots try to adapt more and more over time as we protect more websites and mobile applications with DataDome’s CAPTCHA.

CAPTCHA Forging Evolution

In total, the graph shows more than 1.37M malicious CAPTCHA passing attempts stopped before the bots could go further.

What’s coming next?

We are only at the beginning of the adventure with DataDome CAPTCHA, and we already see the significant improvement it provides to the customers using it—particularly for customers heavily targeted by advanced CAPTCHA bots. Customers using DataDome’s CAPTCHA operate in a wide range of industries, ranging from e-commerce, transportation, and classified ads to financial institutions.

The launch of DataDome’s CAPTCHA has helped us improve our detection capabilities against distributed CAPTCHA bots, all while safeguarding the end-user experience. Because, as we show time and time again, bot developers don’t go on vacation. They continuously adapt their bots to make them stealthier and harder to detect. 

Our team at DataDome is adept at the ongoing fight against malicious bot developers. We are continuously working to add new fingerprinting and behavioral signals to our arsenal, as well as developing new ML models to process them. That is why our CAPTCHA and our bot and online fraud protection will continue to be the best and most comprehensive solutions available.

Stay tuned for more insights from our ongoing battle against bad bots.

*** This is a Security Bloggers Network syndicated blog from Blog – DataDome authored by Antoine Vastel, PhD, Head of Research. Read the original post at: https://datadome.co/threat-research/the-evolution-of-bots-forging-captchas-a-firsthand-report/

Secure Guardrails