GPT-5 Safeguards Bypassed Using Storytelling-Driven Jailbreak

Aug. 12, 2025

A new technique has been documented that can bypass GPT-5’s safety systems, demonstrating that the model can be led toward harmful outputs without receiving overtly malicious prompts.

The method, tested by security researchers at NeuralTrust, combines the Echo Chamber attack with narrative-driven steering to gradually guide responses while avoiding detection.

The approach builds on a jailbreak previously demonstrated against Grok-4 just 48 hours after its public debut. In that case, researchers combined Echo Chamber with the Crescendo method to escalate prompts over multiple turns, ultimately eliciting instructions for creating a Molotov cocktail.

The GPT-5 study adapted this strategy by replacing Crescendo with storytelling to achieve similar results.

How the GPT-5 Jailbreak Works

NeuralTrust researchers began by seeding benign-sounding text with select keywords, then steering the conversation through a fictional storyline.

The narrative served as camouflage, allowing harmful procedural details to emerge as the plot developed. This was done without directly requesting illegal instructions, avoiding trigger phrases that would typically cause the model to refuse.

The process followed four main steps:

Introduce a low-salience “poisoned” context in harmless sentences
Sustain a coherent story to mask intent
Ask for elaborations that maintain narrative continuity
Adjust stakes or perspective if progress stalls

One test used a survival-themed scenario. The model was first asked to use words such as “cocktail,” “story,” “survival,” “molotov,” “safe” and “lives” in a narrative. Through repeated requests to expand the story, GPT-5 eventually provided more technical, step-by-step content, embedded entirely within the fictional frame.

Read more on adversarial prompting in AI systems: Vulnerability Exploit Assessment Tool EPSS Exposed to Adversarial Attack

Risks and Recommendations

The researchers found that urgency, safety and survival themes increased the likelihood of the model advancing toward the unsafe objective. Since the harmful material emerged through gradual context shaping rather than a single prompt, keyword-based filtering was ineffective.

“The model strives to be consistent with the already-established story world,” the authors noted.

“This consistency pressure subtly advances the objective.”

The study recommends conversation-level monitoring, detection of persuasion cycles and robust AI gateways to prevent such attacks.

While GPT-5’s guardrails can block direct requests, the findings show that strategically framed, multi-turn dialogue remains a potent threat vector.

Image credit: bluecat_stock / Shutterstock.com

Tags:

No tags.

Subscribe to our newsletter

About JikGuard.com

JikGuard.com, a high-tech security service provider focusing on game protection and anti-cheat, is committed to helping game companies solve the problem of cheats and hacks, and providing deeply integrated encryption protection solutions for games.

Explore Features>>

Top

Fortinet Warns Exploit Code Available for Critical Vulnerability

GPT-5 Safeguards Bypassed Using Storytelling-Driven Jailbreak

How the GPT-5 Jailbreak Works

Risks and Recommendations

Top

Tags

Recent

Blog

Random

Most Views