
AI Agents at the Gate: Understanding & Securing Against LLM Crawlers
The explosion of large language models (LLMs) has triggered a paradigm shift in web traffic and application security. AI crawlers, whether scraping sites for training data, fetching context for live inference, or simulating human behavior as part of autonomous agents, are no longer edge cases. They are shaping the flow of data across the modern web.
At DataDome, we’ve observed this trend firsthand. In the past 30 days alone, our platform detected 976 million requests from OpenAI-identified crawlers, with a staggering 92% of that traffic tied to ChatGPT. LLM crawlers now account for 4.5% of all legitimate bot traffic we see across our customer base—an all-time high, and a clear sign of acceleration.
This research examines the anatomy of LLM crawler activity, how we categorize and identify these bots, and why a nuanced, intent-based defense strategy is essential for any organization exposed to AI-driven automation.
What are LLM crawlers?
LLM crawlers, or AI bots, are automated clients that interact with websites on behalf of large language models. While there’s some overlap with traditional scrapers and indexers, these bots tend to be more specialized, often operating at greater scale and with more sophisticated parsing capabilities.
At a high level, we distinguish between three functional types:
- Training scrapers, which ingest large volumes of public content to improve model performance. These are typically used to build or refine foundation models and may not honor robots.txt or rate limits.
- Prompt-time fetchers, which retrieve real-time data to supplement LLM outputs—think of AI copilots or search assistants querying web pages on-the-fly to answer user prompts.
- Agentic crawlers, which act more like users. These bots are capable of clicking, scrolling, submitting forms, and navigating complex UIs, often as part of a RAG (retrieval-augmented generation) pipeline or testing framework.
The presence of LLM crawlers is redefining what constitutes “normal” bot behavior. Their impact is already being seen in production environments.
Categorizing LLM crawlers: foundation vs. application layers
One of the first distinctions we draw at DataDome is between crawlers used to train foundation models and those used in downstream applications. This helps us evaluate both the scope and the intention behind the traffic.
Crawlers from foundation model providers—OpenAI (GPT-*), Anthropic (Claude), Meta (LLaMA), Google (Gemini), Amazon (Titan), and others—are generally tied to large-scale training efforts. Their behavior is often more systematic, and in some cases, traceable to disclosed IP ranges or user agents.
On the other hand, crawlers from AI application providers—those building assistants, agents, or vertical-specific models on top of foundational LLMs—are more fragmented. Many of these companies lack public documentation, rotate infrastructure dynamically, or rely on third-party data brokers. Their requests may appear opportunistic, and are often harder to fingerprint or attribute.
This ecosystem is growing fast. And because application builders typically monetize by querying or serving content, not by training, they have a strong incentive to crawl aggressively and frequently.
How we identify LLM crawlers at DataDome
We don’t rely on any single indicator to flag LLM bots. Depending on how forthcoming a provider is, we employ different strategies for identification:
If a crawler publishes its IP range, we create a verified bot model that is blocked by default unless the customer chooses otherwise. OpenAI’s GPTBot and Google-Extended fall into this category.
If the provider uses a distinct reverse DNS, we can build a similarly verified model based on hostname resolution.
In cases where only a User-Agent string is available, and there is no reliable IP or DNS information, we err on the side of caution. These bots are manually flagged and hard-blocked by default. Because the origin of the requests cannot be reliably verified, no detection model is created for them. This is a strict security measure designed to prevent abuse through spoofed identifiers and ensure consistent protection.
For truly opaque activity, we maintain a catch-all rule for any bot-like User-Agent strings not associated with verified models. This approach allows us to scale our response as new crawlers emerge, without letting unknown actors slip through the cracks.
Default response models: no one-size-fits-all
We don’t prescribe a universal policy for how customers should handle LLM traffic. The value or risk of an AI bot depends heavily on context.
Some of our customers see strategic upside in being indexed or referenced by prominent LLMs. For example, e-commerce platforms might benefit from being included in chatbot recommendations or product search results. In those cases, allowing access from a verified LLM application might make sense.
Others are more concerned about IP misuse or data extraction. If an AI bot is training a model on proprietary or monetized content without permission, the case for blocking is strong—especially when there’s no attribution or downstream benefit.
For this reason, every LLM bot model in our platform comes with configurable responses: authorize, block, rate-limit, or challenge. These decisions can be adjusted per customer and updated over time as models or use cases evolve.
Examples from the field
To help clarify how this plays out in practice, here’s how we currently classify and handle some of the most visible LLM crawlers:
Bot Name | Classification | Identification Method | Default Response |
Google-Extended | LLM Application | IP, User-Agent | Authorize |
ClaudeBot (Anthropic) | Foundation Model | User-Agent only | No global decision. Blocked/Authorized on customer basis |
Applebot-Extended | Foundation + Application | IP, rDNS, User-Agent | Authorize (default). Some customers block |
Meta-ExternalAgent | Foundation + Application | ASN + User-Agent | No global decision. Blocked/Authorized on customer basis |
ChatGPT, GPTBot (OpenAI) | Foundation + Application | IP + User-Agent | Hard-block by default. Customer can override the decision |
Note: While some bots publish documentation or IPs, others don’t, making verified classification difficult. Anthropic, for example, publishes IPs used for outbound API traffic—not crawler origins—rendering their crawler unverifiable via IP-based methods. Similarly, Meta’s shared ASN is too broad for consistent attribution.
The business case for bot protection, even when allowing AI agents
The idea of “good bot vs. bad bot” no longer maps cleanly to AI traffic. Many AI crawlers are capable of both benign and exploitative behavior depending on how they’re configured or misused.
Take agentic crawlers, for example. In some scenarios, they streamline product searches, fetch up-to-date content, or support user workflows. But the same capabilities also enable more evasive forms of fraud: account takeovers, promo abuse, scraping behind login, or replaying transactional APIs.
In January 2025, DataDome recorded 178.3 million requests from OpenAI crawlers in a single month, with a 14.5% MoM increase. During the launch of OpenAI’s Operator agent, request volume surged by 48% in just 48 hours.
Zooming out, LLM crawlers accounted for 4.5% of legitimate bot traffic across our platform last month. And more broadly, 36.7% of all traffic we observed over the past year came from non-browser sources, including API clients, mobile SDKs, and autonomous agents.
In this landscape, allowing AI bots doesn’t mean dropping your guard. Bot protection remains critical—especially to ensure that traffic is doing what it claims, and nothing more.
Secure what matters in the age of AI crawlers
LLM crawlers are no longer a fringe phenomenon. They’re accelerating in volume and sophistication, changing how content is accessed, leveraged, and monetized across the web. While some interactions—like AI-powered search assistants surfacing product pages—can create legitimate value, others result in large-scale scraping, unauthorized model training, and abuse of application logic.
This complexity requires more than basic filtering. DataDome equips security teams with the tools to analyze, assess, and respond to this wave of AI-driven traffic in real time. Our platform uses AI to fight AI—leveraging machine learning models trained on billions of daily requests to detect emerging crawler patterns, distinguish between helpful and harmful automation, and act based on behavioral intent.
We classify LLM traffic not by assumptions, but by how it behaves. Whether a crawler is declared, undeclared, or masquerading as a human, our engine continuously evaluates risk and adapts in milliseconds—blocking threats, authorizing trusted bots, and giving customers full control over their exposure.
As LLMs reshape how digital experiences are delivered and consumed, DataDome ensures your applications, APIs, and content remain protected, performant, and aligned to your business goals.
With bot activity now accounting for more than a third of internet traffic, and LLMs making up an increasing share, relying on legacy rules or basic user-agent filtering isn’t enough. Organizations need adaptive, AI-powered defenses that understand behavior, not just labels.
In addition to detection and protection, we now offer new ways to turn AI traffic into value. Through our recently announced partner ecosystem, DataDome customers can take control of how AI agents access their content—and monetize those interactions on their terms.
If your digital properties are exposed to AI-driven automation, it’s time to go beyond binary allow/block logic and secure access based on intent.
That’s where DataDome thrives. Want to see more? Schedule a demo with us today.
*** This is a Security Bloggers Network syndicated blog from Blog – DataDome authored by Florent Pajot. Read the original post at: https://datadome.co/threat-research/ai-agents-llm-crawlers/