Jailbreaks
Implementations of state-of-the-art jailbreaking techniques
The jailbreaks
module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.
Jailbreak Taxonomy
Jailbreaking methods can be categorized along three key dimensions:
- White-box vs. Black-box: Whether the method requires access to model weights and gradients
- Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
- Systematic vs. Manual: The degree of automation in crafting the attack
This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.
Available Methods
The module includes implementations of several state-of-the-art jailbreaking techniques:
AutoDAN
Hierarchical genetic algorithm
AutoDAN Turbo
Lifelong agent for strategy self-exploration
TAP
Tree-of-Attacks with Pruning
GCG
Greedy Coordinate Gradient-based optimization
Crescendo
Progressive multi-turn attack
Bijection Learning
Randomized bijection encodings
Method Classification
Method | Type | Approach | Access Required |
---|---|---|---|
AutoDAN | Semantic | Systematic | White-box |
AutoDAN-Turbo | Semantic | Systematic | Black-box |
TAP | Semantic | Systematic | Black-box |
GCG | Nonsensical | Systematic | White-box |
Crescendo | Semantic | Systematic | Black-box |
Bijection Learning | Nonsensical | Systematic | Black-box |
Common Interface
All jailbreaking methods implement a common interface defined in the base JailbreakMethod
class:
Basic Usage Pattern
While each method has its specific parameters and workflows, they follow a common usage pattern:
Evaluation
The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:
Creating Custom Methods
You can create custom jailbreaking methods by extending the base class:
Next Steps
- Learn about the Evaluator for assessing jailbreaking effectiveness
- Compare the Performance of different methods
- See how these methods integrate with Adversarial Candidate Generators
- Read our comprehensive Jailbreak Cookbook for detailed analysis