And Now, LLMs Don’t Need Human Intervention to Plan and Execute Large, Complex Attacks
Uh oh. Large language models (LLMs) apparently have the capability to autonomously plan and execute very complex attacks against networks.
So say researchers at Carnegie Mellon University who found that “LLMs — when equipped with structured abstractions and integrated into a hierarchical system of agents — can function not merely as passive tools, but as active, autonomous red team agents capable of coordinating and executing multi-step cyberattacks without detailed human instruction.”
And that’s not so good news for cybersecurity. “This raises the stakes for defenders,” says Margaret Cunningham, director of security & AI strategy and field CTO at Darktrace. “Traditional detection methods won’t scale against systems that plan and adapt.”
So much of the earlier research on LLMs explored how they performed “in simplified capture-the-flag (CTF) environments. But the Carnegie Mellon team took it to the next level, “evaluating LLMs in realistic enterprise network environments and considering sophisticated, multi-stage attack plans.”
The Carnegie Mellon researchers intended “to understand whether an LLM could perform the high-level planning required for real-world network exploitation, and we were surprised by how well it worked,” the report quoted PhD candidate Brian Singer, who led the project, as saying. “We found that by providing the model with an abstracted ‘mental model’ of network red teaming behavior and available actions, LLMs could effectively plan and initiate autonomous attacks through coordinated execution by sub-agents.”
To get there, the researchers used LLMs capable of reasoning and knowledgeable of security tools. Those LLMs came up empty during the challenges. But once Singer and team taught the LLMs (and some smaller ones) “a mental model and abstraction of security attack orchestration, they showed dramatic improvement.”
In large part, that’s because researchers removed a limiting factor — the requirement that LLMs execute raw shell commands. Instead, their system imbues the LLMs “with higher-level decision-making capabilities while delegating low-level tasks to a combination of LLM and non-LLM agents,” the study said.
“What’s important to understand is that the researchers didn’t train the LLM to be ‘smarter,’” says Cunningham. Instead, they equipped it with “better tools, clearer instructions, and a structured environment, which enabled it to perform the task incredibly effectively.” She compared it to “giving a kid a well-designed science kit instead of a pile of wires,” contending that the research provides “a clear example of how humans are getting even more sophisticated at ‘engineering’ the LLMs to accomplish complex tasks autonomously.”
To test their models, the team turned to the 2017 Equifax data breach, recreating the network environment in which that breach happened, complete with the same vulnerabilities and topology that characterized Equifax’s network. The LLM replicates the breach without the benefit of human intervention.
“I think the Equifax simulation is a clear signal. When given the right conditions by humans, the system autonomously scanned, exploited, moved laterally, escalated privileges, and exfiltrated data across 48 databases in a realistic enterprise environment,” says Cunningham. “This was a full campaign.”
She does not doubt that, given their existing modular tooling and operational workflows that align with the architecture, “nation-state actors are well-positioned to adopt this.” Cunningham expects integration in the next 6-12 months, with cybercriminals following “especially as open-source tooling lowers the barrier to entry.”
While Jeremy London, director of engineering, AI and threat analytics at Keeper Security, points out the same technology lets organizations “run continuous, cost-effective attack simulations – something that was once expensive and rare – making advanced security testing accessible to companies of all sizes,” he notes that “traditional reactive defenses are no longer enough.”
London advocates for security strategies that are adaptive and continuous and “grounded in principles like zero-trust and least privilege to reduce risk and enable faster response.”
He urges organizations to get on the ball and move quickly to get ahead of attackers. “Teams that embrace AI-powered security early will have a real advantage.” That means ramping up the skills needed to work with new AI tools “to keep pace with evolving threats.”
These advances LLM capabilities will likely compromise supply chain security, already a tough nut to crack, even more. “A single vendor without modern, AI-enhanced defenses can quickly become the weak link that attackers exploit.”
Cunningham is calling for a shift toward behavioral analytics, “specifically models that infer intent from sequences of actions, not just signatures or anomalies,” to reason “what an attacker is trying to achieve, not just what they’re doing in the moment.”
To overcome the risks, industry, government and academia must collaborate — with fractures in alliances becoming more prominent, it is becoming more difficult to have faith in future collaboration. “The real question now,” London says, “is how fast organizations will adopt these tools to build smarter, stronger defenses for the future.”

