We evaluated each jailbreak method using the HarmBench standard (Mazeika et al., 2024), consisting of 200 adversarial prompts assessed across five state-of-the-art LLMs: GPT-4o, GPT-4o-mini, Claude-3.5-Sonnet-v1, Claude-3.5-Sonnet-v2, and Claude-3.7.

Attack Success Rate (ASR)

Our primary metric is the Attack Success Rate (ASR), defined as the percentage of prompts that successfully elicited explicitly harmful responses, as determined by an automated evaluator we designed (full evaluator prompt available in our GitHub repository).

MethodGPT-4oGPT-4o-miniSonnet-3.5-v1Sonnet-3.5-v2Sonnet-3.7
Baseline (no jailbreaking)6.00%10.50%0.50%0.50%7.50%
AutoDAN0.00%0.00%0.00%0.00%0.00%
GCG-t1.50%0.00%0.00%0.00%0.00%
Bijection Learning4.69%4.06%3.75%1.56%4.06%
Crescendo26.50%36.00%21.50%5.50%26.50%
AutoDAN-Turbo36.50%43.00%32.50%4.00%31.50%
TAP38.00%43.50%33.50%9.00%36.50%

For a detailed analysis of jailbreaking methods, their implementations, and effectiveness across different models, refer to our comprehensive Adversarial Robustness Cookbook.