Home » Security Bloggers Network » Anatomy of a Distributed Scraping Attack

Anatomy of a Distributed Scraping Attack

by Antoine Vastel, PhD, Head of Research on September 28, 2023

While a significant number of popular websites use dedicated anti-bot detection software, some online businesses believe they can solve the bot problem on their own. They try to address bots using legacy approaches such as traditional CAPTCHAs or IP-based rate limiting mechanisms and geoblocking available in their WAF.

In this blog post, we focus on a distributed scraping attack that targeted a French marketplace. This kind of attack is quite common, whether due to its scale or the techniques used by the attacker to bypass traditional detection techniques. Popular websites and mobile applications face scraping attacks several times a day, on a daily basis.

Anatomy of the Scraping Attack

The attacker was detected using different techniques and signals, such as behavior analysis and residential proxy detection methods.

How did we isolate the attacker’s requests?

The reason we know the attack was coming from a single attacker is because they used malformed URLs, most likely on purpose. The attacker introduced a letter in the middle of a number in a GET parameter that expects only a number (for example, “http://example.com?products=0a12″ instead of “products=012”).

However, despite the presence of a letter in the GET parameter, the website still parsed the number properly, which means the request returned the content the attacker expected.

Why was the attacker introducing bogus GET parameters?

Some GET parameters with a high value could be suspicious, for example, if you try to add too many products to a cart at once, or if you visit a page with a high value at the beginning of a browsing session. Introducing bogus parameters can help an attacker avoid detection.

In total, over a week, the attacker made 1.1M search requests to scrape data from the website. The graph below shows the number of malicious bot requests made by the attacker per day.

Graph Number of Bot Requests Over a Week

Number of scraping bot requests linked to our attacker over a week.

What signals and fingerprints set the attacker apart?

The attacker performed search requests with different parameters to explore and gather data from the French real estate market. The attack was heavily distributed using more than 45,000 different IP addresses.

The attacker also randomized the bot fingerprint by using different up-to-date or recent user-agents, such as:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1900.70
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Safari/605.1.15

In contrast to the randomized user-agents, the attacker’s bots used consistent accept-language headers—in this instance, French, as they targeted a French website:

fr-FR,fr;q=0.9
fr-FR,fr;q=0.5

The attacker also used HTTP headers consistent with the type of resources requested (a search request returning HTML content) and the user-agent forged by the bots consistent for a Chrome browser, e.g.:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Note that while the TLS fingerprints were not fully consistent, they cannot be linked to popular HTTP client libraries.

The attacker mostly used French residential IPs: ~45k distinct IPs over a week.

Graph Attacker IPs Linked to French ISPs

The IPs the attacker used are linked to the main French ISPs:

Orange, Historic French ISP (Equivalent to AT&T in the US)
Free and Free Mobile
SFR
Bouygues Telecom

Why Traditional In-House Bot Detection Techniques Are Not Enough

In summary, the scraping attack targeted a French marketplace, making 1.1M search requests over a week. The bots used HTTP headers forged to be consistent with the type of resources requested and the type of browser claimed in the user-agent. The attacker also used French accept-language, with more than 45k distinct French residential proxy IP addresses. On any given day, each IP involved in the attack made between one and ten requests.

The following traditional techniques used in WAFs or implemented in-house would not have worked against the bots used in this attack.

Signature-based blocking: Server-side fingerprints were rather consistent and evolving over time. Thus, there was no particular, stable signature that can be used for blocking.
Geoblocking: Could not be used, as bots leveraged French IPs. (We generally don’t recommend using geoblocking, as this can lead to a lot of false positives.)
Blocking data center IPs: The attacker used mostly residential IPs, so blocking data center IPs would not have stopped them.
IP-based rate limiting: Wouldn’t have been effective, as each IP made fewer than ten requests on average per day.
Blocking using non-matching user language: Using other contextual information like the user language wouldn’t have helped, as the attacker adapted them to the country of the website targeted.

A counter argument could be that using traditional CAPTCHAs would have helped. While that may be true in some cases, it’s important to think of the impact in terms of UX. Which of your users want to get shown a CAPTCHA as soon as they perform their first search? Additionally, traditional CAPTCHAs are very easily bypassed by CAPTCHA farms, where humans solve challenges for bots.

The attack we described above may seem sophisticated—and you may think these kinds of attackers are not targeting your website. But sophisticated techniques have become quite common lately, even for scraping. Several open source packages enable attackers to make more realistic bots. Scraping bots as a service also provide these kinds of features off the shelf through an API.

Conclusion

To protect your websites, mobile apps, and APIs against bad bots—whether it is against scraping attacks, credential stuffing, layer 7 DDoS, or something else—it’s important to acknowledge that attackers have become increasingly sophisticated over the last few years because of all the libraries, software, CAPTCHA farms, and bots as a service they have at their disposal. Using traditional in-house bot detection techniques or WAFs is no longer enough to protect your business and customers. On top of that, traditional techniques like IP-based rate limiting, CAPTCHAs, and geoblocking tend to negatively affect user experience.

Sophisticated attacks require sophisticated defenses. Keeping up with the latest attack vectors is a full-time challenge—one you don’t want your teams to waste time and resources trying to take on. DataDome’s automated bot and fraud detection uses machine learning to improve with every interaction. It blocks bad bots in less than 2 milliseconds without compromising your user experience.

Our BotTester tool can give you a peek into the basic bots reaching your websites, apps, and/or APIs. Or you can spot more sophisticated threats now with a free trial of DataDome.

*** This is a Security Bloggers Network syndicated blog from DataDome Blog – DataDome authored by Antoine Vastel, PhD, Head of Research. Read the original post at: https://datadome.co/threat-research/anatomy-of-a-distributed-scraping-attack/