The jailbreaks module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.
Jailbreak Taxonomy
Jailbreaking methods can be categorized along three key dimensions:
- White-box vs. Black-box: Whether the method requires access to model weights and gradients
- Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
- Systematic vs. Manual: The degree of automation in crafting the attack
This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.
Available Methods
The module includes implementations of several state-of-the-art jailbreaking techniques:

AutoDAN
Hierarchical genetic algorithm

AutoDAN Turbo
Lifelong agent for strategy self-exploration

TAP
Tree-of-Attacks with Pruning

GCG
Greedy Coordinate Gradient-based optimization

Crescendo
Progressive multi-turn attack

Bijection Learning
Randomized bijection encodings
Method Classification
| Method | Type | Approach | Access Required |
|---|---|---|---|
| AutoDAN | Semantic | Systematic | White-box |
| AutoDAN-Turbo | Semantic | Systematic | Black-box |
| TAP | Semantic | Systematic | Black-box |
| GCG | Nonsensical | Systematic | White-box |
| Crescendo | Semantic | Systematic | Black-box |
| Bijection Learning | Nonsensical | Systematic | Black-box |
Common Interface
All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:
import abc
from typing import Dict, Any, List
class JailbreakMethod(abc.ABC):
def __init__(self, **kwargs):
self.config = kwargs
@abc.abstractmethod
def optimize(self, goals: List[str], **kwargs):
# Implementation specific to each method
pass
@property
def name(self) -> str:
return self.__class__.__name__
def get_config(self) -> Dict[str, Any]:
return self.config.copy()Basic Usage Pattern
While each method has its specific parameters and workflows, they follow a common usage pattern:
from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN
# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")
# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)
# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]
# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)
# Analyze results
for goal, prompts in results.items():
print(f"Goal: {goal}")
for i, prompt in enumerate(prompts):
print(f" Prompt {i+1}: {prompt}")
response = target_model.query(prompt)
print(f" Response: {response[:100]}...")Evaluation
The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:
from generalanalysis.jailbreaks import evaluator
# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
target_model=target_model,
prompt="Your jailbreaking prompt...",
goal="The goal you're trying to achieve"
)
print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")Creating Custom Methods
You can create custom jailbreaking methods by extending the base class:
from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel
class MyCustomJailbreak(JailbreakMethod):
def __init__(self, target_model: BlackBoxModel, **kwargs):
super().__init__(**kwargs)
self.target_model = target_model
def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
results = {}
for goal in goals:
# Your custom implementation here
generated_prompts = ["Prompt 1", "Prompt 2"]
results[goal] = generated_prompts
return resultsNext Steps
- Learn about the Evaluator for assessing jailbreaking effectiveness
- Compare the Performance of different methods
- See how these methods integrate with Adversarial Candidate Generators
- Read our comprehensive Jailbreak Cookbook for detailed analysis