Just 48 hours after its public debut, Grok-4 was successfully jailbroken using a newly enhanced attack method.
Researchers from NeuralTrust combined two known strategies, Echo Chamber and Crescendo, to bypass the AI model’s safety systems and elicit harmful responses without issuing any explicitly malicious prompts.
The attack was designed to test whether a state-of-the-art large language model (LLM) could be manipulated into providing illegal instructions.
In this case, the target was to get Grok-4 to reveal step-by-step directions for making a Molotov cocktail, a scenario previously used in the original Crescendo paper.
A Dual-Phase Approach to Jailbreaking
NeuralTrust began by running the Echo Chamber attack, which poisons the model’s conversational context and nudges it toward unsafe behavior.
In the initial trial, the prompts were too direct, triggering Grok-4’s internal safeguards. However, after adjusting the inputs to be more subtle, the team successfully initiated Echo Chamber’s full workflow, including a persuasion cycle designed to gradually shift the model's tone.
Although Echo Chamber alone brought the model closer to the objective, it wasn’t enough to fully break through. That’s when Crescendo was added – a technique that incrementally intensifies a prompt across multiple conversational turns to escalate the model’s response.
With just two additional exchanges, the combined method succeeded in eliciting harmful content, only two days into Grok-4’s deployment.
Read more on adversarial prompting in AI systems: Vulnerability Exploit Assessment Tool EPSS Exposed to Adversarial Attack
Measured Results Across Multiple Scenarios
Following this initial success, the NeuralTrust team tested other prompts involving illegal activities.
They manually selected objectives from the Crescendo paper, including those related to drug synthesis and chemical weapons. The combined method proved effective in several of these cases, including:
-
67% success rate for Molotov cocktail instructions
-
50% for methamphetamine-related prompts
-
30% for toxin-related responses
In one instance, Grok-4 reached a harmful outcome in a single conversational turn, bypassing even the Crescendo phase.
New Risks for Multi-Turn LLM Safety
The key insight of this research is that Grok-4 did not need to be explicitly asked to do anything illegal. Instead, the conversation was shaped gradually using carefully engineered prompts.
As the researchers noted, “attacks can bypass intent or keyword-based filtering by exploiting the broader conversational context.”
The study highlights the challenge of defending against subtle, multi-step attacks. While Grok-4 and other LLMs are typically trained to detect and reject harmful prompts, techniques like Echo Chamber and Crescendo exploit the model’s broader dialogue dynamics, often slipping through unnoticed.
The jailbreak so soon after Grok-4’s release underscores the urgency of advancing LLM safety beyond surface-level filtering, particularly as these systems are increasingly deployed in high-stakes environments.
Image credit: gguy / Shutterstock.com
No tags.