Implementations of state-of-the-art jailbreaking techniques
The jailbreaks
module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.
Jailbreaking methods can be categorized along three key dimensions:
This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.
The module includes implementations of several state-of-the-art jailbreaking techniques:
Hierarchical genetic algorithm
Lifelong agent for strategy self-exploration
Tree-of-Attacks with Pruning
Greedy Coordinate Gradient-based optimization
Progressive multi-turn attack
Randomized bijection encodings
Method | Type | Approach | Access Required |
---|---|---|---|
AutoDAN | Semantic | Systematic | White-box |
AutoDAN-Turbo | Semantic | Systematic | Black-box |
TAP | Semantic | Systematic | Black-box |
GCG | Nonsensical | Systematic | White-box |
Crescendo | Semantic | Systematic | Black-box |
Bijection Learning | Nonsensical | Systematic | Black-box |
All jailbreaking methods implement a common interface defined in the base JailbreakMethod
class:
While each method has its specific parameters and workflows, they follow a common usage pattern:
The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:
You can create custom jailbreaking methods by extending the base class:
Implementations of state-of-the-art jailbreaking techniques
The jailbreaks
module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.
Jailbreaking methods can be categorized along three key dimensions:
This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.
The module includes implementations of several state-of-the-art jailbreaking techniques:
Hierarchical genetic algorithm
Lifelong agent for strategy self-exploration
Tree-of-Attacks with Pruning
Greedy Coordinate Gradient-based optimization
Progressive multi-turn attack
Randomized bijection encodings
Method | Type | Approach | Access Required |
---|---|---|---|
AutoDAN | Semantic | Systematic | White-box |
AutoDAN-Turbo | Semantic | Systematic | Black-box |
TAP | Semantic | Systematic | Black-box |
GCG | Nonsensical | Systematic | White-box |
Crescendo | Semantic | Systematic | Black-box |
Bijection Learning | Nonsensical | Systematic | Black-box |
All jailbreaking methods implement a common interface defined in the base JailbreakMethod
class:
While each method has its specific parameters and workflows, they follow a common usage pattern:
The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:
You can create custom jailbreaking methods by extending the base class: