Performance Comparison

Understanding how different jailbreak methods perform against production language models is essential for choosing the right testing strategy. This page presents benchmark results from our standardized evaluation pipeline so that red teaming practitioners can compare methods side by side and select the approach that best fits their threat model.

Evaluation methodologyEvaluation methodology

We evaluated each jailbreak method using the HarmBench standard (Mazeika et al., 2024 ), consisting of 200 adversarial prompts assessed across five state-of-the-art LLMs: GPT-4o, GPT-4o-mini, Claude-3.5-Sonnet-v1, Claude-3.5-Sonnet-v2, and Claude-3.7.

All methods were tested with their default configurations as documented in this site. Each evaluation used the same prompt set to ensure fair comparison. The evaluation was fully automated to remove subjective bias — see the LLM jailbreak evaluator page for details on the scoring pipeline.

Attack Success Rate (ASR)Attack Success Rate (ASR)

Our primary metric is the Attack Success Rate (ASR), defined as the percentage of prompts that successfully elicited explicitly harmful responses, as determined by an automated evaluator we designed (full evaluator prompt available in the General Analysis AI red teaming repository ).

A higher ASR indicates the method is more effective at bypassing the target model’s safety alignment. Note that ASR alone does not capture stealth, cost, or latency — factors that matter when selecting a method for a particular engagement.

Method	GPT-4o	GPT-4o-mini	Sonnet-3.5-v1	Sonnet-3.5-v2	Sonnet-3.7
Baseline (no jailbreaking)	6.00%	10.50%	0.50%	0.50%	7.50%
AutoDAN	0.00%	0.00%	0.00%	0.00%	0.00%
GCG-t	1.50%	0.00%	0.00%	0.00%	0.00%
Bijection Learning	4.69%	4.06%	3.75%	1.56%	4.06%
Crescendo	26.50%	36.00%	21.50%	5.50%	26.50%
AutoDAN-Turbo	36.50%	43.00%	32.50%	4.00%	31.50%
TAP	38.00%	43.50%	33.50%	9.00%	36.50%

Key takeawaysKey takeaways

Several patterns emerge from these benchmarks that inform how teams should approach red teaming engagements:

TAP and AutoDAN-Turbo lead on raw effectiveness. Both methods achieve attack success rates above 30 percent on most models, making them the strongest choices when the goal is to discover the maximum number of exploitable behaviors. TAP’s tree-based search is particularly efficient for wide coverage, while AutoDAN-Turbo’s lifelong strategy learning excels at adapting to model-specific defenses.

Crescendo offers a strong multi-turn alternative. With ASR rates between 21 and 36 percent (excluding the hardened Sonnet-3.5-v2), Crescendo demonstrates that multi-turn conversation attacks remain a viable vector. Its dialogue-based approach is harder for safety filters to detect because individual turns appear benign.

Claude 3.5 Sonnet v2 is the most robust model tested. Across all methods, Sonnet-3.5-v2 maintained the lowest ASR — never exceeding 9 percent. This suggests that Anthropic’s Constitutional AI training for this version is significantly more resilient to automated adversarial attacks. Teams that need the highest safety baseline should benchmark against this model.

Gradient-based and genetic methods underperform on modern frontier models. AutoDAN (genetic) and GCG (gradient-based) produced near-zero success rates, indicating that current frontier models have largely patched the vulnerabilities these older techniques targeted. They may still be useful for testing fine-tuned or open-weight models where the attack surface differs.

Baseline refusal rates vary significantly. Even without any jailbreaking, GPT-4o-mini already complied with 10.5 percent of harmful prompts, while Claude-3.5-Sonnet-v1 and v2 refused nearly everything. This underscores the importance of establishing a baseline before running adversarial tests.

Choosing a methodChoosing a method

Use the table below as a decision guide based on your engagement goals:

Goal	Recommended method	Rationale
Maximum coverage in minimum time	TAP	Highest ASR, tree-based exploration covers the widest attack surface
Adaptive long-running campaigns	AutoDAN-Turbo	Strategy library improves over time as it learns model-specific weaknesses
Realistic dialogue-based attacks	Crescendo	Multi-turn approach mirrors real-world social engineering patterns
White-box model audits	GCG	Gradient access enables precise suffix optimization on open-weight models
Encoding-based filter bypass	Bijection Learning	Effective when safety filters rely heavily on keyword or pattern matching

Reproducing these resultsReproducing these results

All benchmark code and evaluation scripts are available in the General Analysis AI red teaming GitHub repository . You can replicate the full evaluation by loading the HarmBench dataset and running each method with the configurations documented on their respective pages.

For a detailed analysis of jailbreaking methods, their implementations, and effectiveness across different models, refer to our comprehensive Adversarial Robustness Cookbook .