Performance Comparison

We evaluated each jailbreak method using the HarmBench standard (Mazeika et al., 2024), consisting of 200 adversarial prompts assessed across five state-of-the-art LLMs: GPT-4o, GPT-4o-mini, Claude-3.5-Sonnet-v1, Claude-3.5-Sonnet-v2, and Claude-3.7.

Attack Success Rate (ASR)

Our primary metric is the Attack Success Rate (ASR), defined as the percentage of prompts that successfully elicited explicitly harmful responses, as determined by an automated evaluator we designed (full evaluator prompt available in our GitHub repository).

Method	GPT-4o	GPT-4o-mini	Sonnet-3.5-v1	Sonnet-3.5-v2	Sonnet-3.7
Baseline (no jailbreaking)	6.00%	10.50%	0.50%	0.50%	7.50%
AutoDAN	0.00%	0.00%	0.00%	0.00%	0.00%
GCG-t	1.50%	0.00%	0.00%	0.00%	0.00%
Bijection Learning	4.69%	4.06%	3.75%	1.56%	4.06%
Crescendo	26.50%	36.00%	21.50%	5.50%	26.50%
AutoDAN-Turbo	36.50%	43.00%	32.50%	4.00%	31.50%
TAP	38.00%	43.50%	33.50%	9.00%	36.50%

For a detailed analysis of jailbreaking methods, their implementations, and effectiveness across different models, refer to our comprehensive Adversarial Robustness Cookbook.

AutoDAN-Turbo

Evaluator

On this page

Attack Success Rate (ASR)

Boiler Room

Adversarial Generators

Jailbreak Methods

Attack Success Rate (ASR)

Boiler Room

Adversarial Generators

Jailbreak Methods

​Attack Success Rate (ASR)

Attack Success Rate (ASR)