Home » Security Bloggers Network » Tracking AI Crawlers in the Wild: Inside the DataDome Intel Database

Tracking AI Crawlers in the Wild: Inside the DataDome Intel Database

by Paige Tester on June 18, 2025

The post Tracking AI Crawlers in the Wild: Inside the DataDome Intel Database appeared first on Blog – Datadome.

Over the past year, AI agents, and the AI crawlers they rely on, have quietly become a part of how people navigate the web. They help users compare prices, summarize content, and even complete transactions, often without any direct interaction. These agents rely on crawlers, scripts, and automation frameworks to do their work, and in many cases, they operate in ways that are difficult to distinguish from traditional bots, or even real users.

That’s created new challenges for security, platform, and fraud teams alike. Many organizations have invested heavily in bot management, fraud detection, and access controls. But the arrival of AI agents changes the equation. These tools don’t behave like typical crawlers. They don’t always follow the rules. And they’re being adopted faster than most teams can track.

It’s a problem we’ve been tracking closely. And it’s changing fast.

Not all AI agents are malicious, but all of them need to be understood

There’s a growing spectrum of AI-powered automation in the wild. Some of it is helpful, some of it is neutral, and some of it presents clear risks to business operations, infrastructure, and trust.

We’re seeing increased traffic from named LLM crawlers used to fetch pages for training or live inference, like GPTBot and ClaudeBot. Others are harder to detect: anonymous crawlers behind rotating proxies, embedded shopping assistants, and autonomous agents interacting with APIs or completing tasks on behalf of users. Still others operate as part of commercial scraping tools, using AI to extract structured data while appearing human.

Some of these agents may be operating within platform policies. Many are not. Some exist in a gray area, making them difficult to classify without additional context. And because this kind of traffic often blends in with legitimate sessions, it rarely triggers traditional bot defenses.

That’s the real change: it’s no longer enough to ask “Is this a bot?” Now the question is “What kind of automation is this, and why is it here?”

Traditional threat intel doesn’t help answer those questions

Most threat intelligence sources were built for a different time. They’re great at surfacing malware indicators, infrastructure tied to APTs, or known vulnerabilities being exploited in the wild. But they offer little insight into the kinds of automation most businesses are now dealing with daily.

There’s no feed that tells you what scraping tool just hit your login page, what automation framework is being used to test your checkout, or which AI crawler is bypassing your robots.txt file.

That leaves teams without the context they need to understand, and confidently act on, automation that’s already affecting performance and risk.

That’s why we created DataDome Intel

We believe that visibility should be a given. Defenders should have access to the information they need to make informed decisions, regardless of what tools they’re using, or whether they’re a DataDome customer.

DataDome Intel is a public, continuously growing database of bots, crawlers, spoofing tools, automation frameworks, and AI agents. It’s the most comprehensive resource of its kind available today, and it’s open to everyone.

Each entry includes a plain-English description of the tool or crawler, how it typically behaves, and what it’s used for. We also include guidance on whether it honors robots.txt, and how to block or allow it based on your business needs.

We update the database continuously, based on real-world traffic observed across our global customer network. Right now, that includes:

Over 57,000 known crawler user-agents
Dozens of AI and LLM-linked crawlers
Headless browsers, fingerprint evasion tools, CAPTCHA solvers, and more

We’re making this data public to help teams better understand what’s hitting their infrastructure and to promote safe adoption of agentic AI. The only way to separate what’s helpful from what’s harmful is to know what’s out there.

It’s about control, not just blocking

There’s a tendency to treat bots as either good or bad. But in practice, automation is more nuanced. Some crawlers serve accessibility functions, while others gather business intelligence. Some are essential to your SEO, and others may be scraping pricing data, manipulating availability, or attempting payment fraud.

Managing this kind of traffic requires context to know what’s behind each request, understand what it’s doing, and have the right tools to decide how to handle it.

Whether your strategy is to allow, block, challenge, or monetize automation traffic, it starts with visibility. That’s what DataDome Intel is designed to provide.

The automation landscape is evolving. Shared intelligence needs to keep up

The line between human and automated traffic is no longer clear. As AI agents become more capable, and more common, security, fraud, and platform teams need better tools to navigate this shift.

That includes having access to threat intelligence that reflects today’s automation risks, not just yesterday’s attacks. It also requires shared context across the industry, so we’re not all solving the same problems in isolation, and open access to information that helps defenders understand what they’re dealing with, whether they’re a customer or not.

We built DataDome Intel to help lead that shift as a resource that reflects what’s happening on the web right now, and what teams need to stay ahead of it.

You can explore the database here.

*** This is a Security Bloggers Network syndicated blog from DataDome authored by Paige Tester. Read the original post at: https://datadome.co/bot-management-protection/public-ai-crawlers-threat-intel-database/