Home » Cybersecurity » It Took a Day for SPLX, NeuralTrust to Jailbreak OpenAI’s GPT-5

It Took a Day for SPLX, NeuralTrust to Jailbreak OpenAI’s GPT-5

by Jeffrey Burt on August 15, 2025

OpenAI’s newest large language model (LLM), GPT-5, may be – in the words of company executives – the “smartest, fastest, most useful model yet,” but it’s not the most secure.

A day after the generative AI company unveiled GPT-5, two security vendors wrote that, using jailbreaking and other attack techniques usually deployed against generative AI LLMs, they were able to manipulate the new model to publish harmful outputs.

Researchers with AI security startup SPLX wrote in a report that the raw configuration of the model – which has no system prompt – “is nearly unusable for enterprise out of the box” and that “even OpenAI’s internal prompt layer leaves significant gaps, especially in Business Alignment.”

In addition, in multiple configurations, the previous GPT-4o version was safer and more secure than the latest, according to SPLX.

At the same time, NeuralTrust, another generative AI security startup, wrote in a report that its researchers were able to jailbreak GPT-5 by combining Echo Chamber – a novel jailbreaking technique that the vendor introduced in late June – with narrative-driven steering, or storytelling.

“We use Echo Chamber to seed and reinforce a subtly poisonous conversational context, then guide the model with low-salience storytelling that avoids explicit intent signaling,” wrote Martí Jordà, security researcher with NeuralTrust. “This combination nudges the model toward the objective while minimizing triggerable refusal cues.”

The reports come even as both OpenAI and Microsoft boasted of the model’s security and safety.

The Model’s Security Needs Work

The reports put some shade on the high-profile August 7 launch of GPT-5, when OpenAI boasted of the LLM’s unified system that includes a smart and efficient model that answers most questions, a deeper reasoning model that addresses more difficult problems, and real-time router that directs conversations to one or the other based on the type of conversations, the complexity, needed tools, and the user’s explicit intent.

Enterprises, research organizations, and other entities are quickly embracing LLMs and, now, AI agents, but cybersecurity pros are pushing AI vendors to include greater security in their offerings, particularly given the sensitive and corporate information that are increasingly are flowing through them.

Guardrails Come with Cost, Performance Concerns

The models, from OpenAI and others, like Anthropic, Google, Mistral, and Meta, include guidelines to protect against answering queries that are considered harmful, unethical, or unsafe. However, bad actors continue to explore ways to bypass such protections through methods like jailbreaks, where they craft prompts in ways to elicit the models to reveal data or produce responses that include content that violates their guardrails.

“The root issue is that the model’s alignment and filtering mechanisms aren’t airtight, so a smartly phrased or encoded prompt can dodge those blocks,” CyberArk engineers wrote in a blog post in April. “Because LLMs are designed to be flexible and helpful, it can be surprisingly easy to weave around the normal safeguards.”

Guardrails are essential, but they can add to response times to prompts and to costs because more LLM calls require more compute resources.

“Organizations must balance the need for robust security measures with the potential trade-offs in performance and cost, while recognizing that security is essential for mitigating greater risks and must be prioritized accordingly,” the engineers wrote.

Myriad Simulated Attacks

In their experiments, SPLX researchers used more than 1,000 attack scenarios against three GPT-5 configurations – no system prompt, basic system prompt, and hardened prompt. The model without system prompts failed 89% of the time – allowing for a successful attack – while using the basic prompt drops the failure rate to 43%. The model with the hardened prompt failed 45% of the time.

For GPT-4o, the failure rate for the configuration without the system prompt was 71%.

“Even GPT-5, with all its new ‘reasoning’ upgrades, fell for basic adversarial logic tricks,” SPLX researchers wrote. “One of the most effective techniques we used was a StringJoin Obfuscation Attack, inserting hyphens between every character and wrapping the prompt in a fake ‘encryption challenge.’ This mirrors similar vulnerabilities we exposed in GLM-4.5 [from Chinese company Z.ai], Kimi K2 [Chinese company Moonshot AI], and Grok 4 [xAI], suggesting systemic weaknesses across leading LLMs.”

They added that “OpenAI’s latest model is undeniably impressive, but security and alignment must still be engineered, not assumed.”

Return of Echo Chamber

NeuralTrust researchers in June said Echo Chamber was a technique that uses context poisoning, which involves introducing malicious training data to compromise the AI model – and multi-turn reasoning to guide models into creating harmful content, all of which is done without generating an explicitly dangerous prompt. Instead, it’s done via indirect references.

They used Echo Chamber with storytelling to get the GPT-5 model to eventually describe how to building a Molotov cocktail explosive device, starting with a prompt asking it to create sentences using the words “cocktail,” “story,” “survival,” “molotov,” “safe,” and “lives.”

“This progression shows Echo Chamber’s persuasion cycle at work: the poisoned context is echoed back and gradually strengthened by narrative continuity,” Jordà wrote. “The storytelling angle functions as a camouflage layer, transforming direct requests into continuity-preserving elaborations. We deliberately omit operational details and redact any procedural specifics.”

Move Beyond Single-Turn Intent

Eventually, “the narrative device increases stickiness: the model strives to be consistent with the already-established story world. This consistency pressure subtly advances the objective while avoiding overtly unsafe prompts,” he wrote. “This reinforces a key risk: keyword or intent-based filters are insufficient in multi-turn settings where context can be gradually poisoned and then echoed back under the guise of continuity.”

He said organizations need do more than scan for single-turn intent and evaluate defenses that run at the conversational level, monitor context drift, and detect persuasion through tools like red teaming and AI gateways.