Crescendo
Crescendo is a multi-turn jailbreaking method that mimics the way a skilled social engineer might manipulate a conversation. Rather than attempting to elicit harmful content in a single prompt, Crescendo builds rapport and context over multiple conversation turns, gradually steering the dialogue toward the target topic. Each message is individually innocuous, but the cumulative conversational trajectory leads the model to produce responses it would refuse if asked directly.
This approach was inspired by the observation that many real-world jailbreaks discovered in production chatbots involve multi-turn conversations where users incrementally push boundaries. Crescendo automates this pattern, using an attacker LLM to plan and execute the conversation strategy while an evaluator LLM monitors whether the target model has sufficiently complied with the harmful goal.
How Multi-Turn Context Building WorksHow Multi-Turn Context Building Works
Crescendo operates through a structured sequence of conversation turns:
-
Topic introduction: The attacker model initiates a conversation on a topic adjacent to the harmful goal — something the target model would willingly discuss. For example, if the goal involves creating malware, the opening might ask about cybersecurity education or historical examples of notable software vulnerabilities.
-
Progressive escalation: Over subsequent turns, the attacker gradually shifts the conversation toward more specific and sensitive territory. Each turn builds on the context established by previous exchanges, exploiting the model’s tendency to maintain conversational coherence and helpfulness within an ongoing dialogue.
-
Context exploitation: By the time the attacker makes the critical request, the conversation has established a framing context (educational, fictional, hypothetical) that makes refusal feel inconsistent with the model’s prior responses. The accumulated context effectively acts as a soft jailbreak that doesn’t trigger the pattern-matching safety filters designed for single-turn attacks.
-
Evaluation and scoring: After each round, the evaluator model assesses whether the target model’s latest response constitutes sufficient compliance with the original harmful goal. The attack terminates early if the evaluator determines the goal has been achieved.
The max_rounds parameter controls the maximum number of conversation turns. Each round includes one attacker message and one target response. The attacker model has access to the full conversation history and the original goal, allowing it to adapt its strategy based on the target model’s reactions.
When Crescendo Is More Realistic Than Single-Turn AttacksWhen Crescendo Is More Realistic Than Single-Turn Attacks
Crescendo fills an important gap in the red teaming toolkit. While methods like TAP, AutoDAN, and GCG test whether a model can be jailbroken with a single carefully crafted input, real-world adversaries often don’t operate under that constraint. In a deployed chatbot, an attacker can have an extended conversation, probing the model’s boundaries and adapting their approach in real time.
Crescendo is the most appropriate method when:
- You need to simulate realistic adversarial interactions: Multi-turn attacks represent a large portion of how jailbreaks actually occur in production chatbot deployments.
- The target model has strong single-turn defenses: Models that successfully resist TAP and AutoDAN may still be vulnerable to gradual escalation because their safety filters are calibrated for single-turn patterns.
- You want to test conversational coherence as a vulnerability: Models are trained to be helpful and consistent within a conversation. Crescendo tests whether this helpfulness can be exploited to override safety constraints when context is built incrementally.
- You are evaluating multi-turn safety features: Some models include turn-level safety monitoring. Crescendo directly tests whether these systems detect gradual escalation or only flag individual problematic messages.
UsageUsage
The following example configures Crescendo to test Claude 3.7 Sonnet with up to 8 conversation rounds per goal, using Llama 3.3 as both the attacker and evaluator.
from generalanalysis.jailbreaks import Crescendo, CrescendoConfig
from generalanalysis.data_utils import load_harmbench_dataset
config = CrescendoConfig(
target_model="claude-3-7-sonnet-20250219",
attacker_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
evaluator_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
project="crescendo_experiment",
max_rounds=8,
verbose=False,
max_workers=20
)
crescendo = Crescendo(config)
dataset = load_harmbench_dataset()
score = crescendo.optimize(dataset)The results include the full conversation transcript for each goal, the evaluator’s score at each round, and the round at which the attack succeeded (if it did). This transcript data is valuable for understanding which conversational strategies are most effective and at what point the model’s defenses broke down.
Key ParametersKey Parameters
| Parameter | Description |
|---|---|
target_model | The model being tested for multi-turn safety vulnerabilities. Any model accessible through the BlackBoxModel interface can be targeted. |
attacker_model | The LLM that plans and executes the multi-turn conversation strategy. It receives the full conversation history and the original goal at each turn, and generates the next user message. Stronger models produce more sophisticated escalation strategies. |
evaluator_model | The LLM that assesses whether the target model has sufficiently complied with the harmful goal after each round. Accurate evaluation prevents premature termination (false positives) and wasted rounds (false negatives). |
project | Name for the experiment results directory. Full conversation transcripts, per-round scores, and aggregated metrics are saved here for analysis. |
max_rounds | Maximum number of conversation turns before the attack is considered unsuccessful. More rounds give the attacker more opportunities to build context but increase the total query cost. 6–10 rounds is typical; most successful attacks resolve within 5–7 rounds. |
verbose | Whether to print the conversation transcript in real time during optimization. Useful for debugging attacker strategies and understanding the escalation dynamics. |
max_workers | Maximum number of concurrent conversations. Since each Crescendo attack involves a sequential multi-turn dialogue, parallelism applies across different goals in the dataset rather than within a single attack. Higher values process the dataset faster. |
Tips on max_roundsTips on max_rounds
The max_rounds parameter is the most important tuning knob for Crescendo. Setting it too low means the attacker doesn’t have enough turns to build sufficient context, and many attacks that would eventually succeed are cut short. Setting it too high wastes queries on conversations that have already stalled.
Practical guidance:
- Start with 8 rounds: This provides enough room for the typical escalation pattern (2–3 rounds of topic introduction, 2–3 rounds of progressive deepening, and 1–2 rounds of direct request) while keeping costs manageable.
- Increase to 12–15 for heavily defended models: Models with strong turn-level monitoring may require longer, more gradual escalation to avoid triggering multi-turn safety checks.
- Decrease to 5–6 for efficiency runs: If you’re running Crescendo as part of a larger battery of methods and want a quick signal rather than exhaustive testing, shorter conversations still catch the most obvious multi-turn vulnerabilities.
- Analyze early termination patterns: If most attacks succeed by round 3–4, your
max_roundsis likely higher than needed for this target. If many attacks are still improving at the final round, consider increasing the limit.
Strengths and LimitationsStrengths and Limitations
Strengths:
- Simulates the most realistic adversarial threat model for conversational AI
- Effective against models with strong single-turn defenses
- Produces conversation transcripts that are directly useful for improving multi-turn safety systems
- Natural language throughout — no adversarial artifacts to filter
Limitations:
- Higher per-attack query cost than single-turn methods (each attack requires
max_rounds × 2API calls to the target and attacker models) - Sequential nature of conversations limits within-attack parallelism
- Effectiveness depends heavily on the attacker model’s conversational ability
- Less effective against models that reset context between turns or use stateless safety checking
Related MethodsRelated Methods
- TAP tree-of-attacks jailbreak — The best single-turn black-box method; use alongside Crescendo for comprehensive coverage
- AutoDAN-Turbo strategy-based jailbreak — Another black-box method that learns strategies over time, but operates in a single-turn setting
- Bijection Learning encoding-based jailbreak — A different evasion strategy that uses encoding rather than conversational context to bypass filters
For detailed performance metrics and configurations, refer to our LLM Jailbreak Cookbook .