Adversarial Candidate Generator
Algorithms for generating adversarial prompts
The Adversarial Candidate Generator module provides algorithms for generating adversarial prompts designed to test the robustness of language model safety guardrails. These generators implement different approaches to creating potential jailbreak candidates.
Overview
Adversarial Candidate Generators act as the core prompt engineering component in many jailbreaking methods. They create variations of prompts that attempt to bypass model safety measures while retaining the semantic goal of the original request.
Base Class
All generators inherit from the AdversarialCandidateGenerator
base class:
Available Generators
TreeRefinementGenerator
The TreeRefinementGenerator
generates adversarial prompts by creating a tree of refinements, using an attacker model to iteratively improve prompts based on target model responses. This is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method.
MultiTurnAttackGenerator
The MultiTurnAttackGenerator
creates conversation-based attacks that build context over multiple turns, implementing approaches similar to the Crescendo technique. It’s designed to gradually build up context through seemingly innocent questions.
StrategyAttackGenerator
The StrategyAttackGenerator
implements advanced prompt generation strategies used in methods like AutoDAN-Turbo, focusing on creating prompts that appear benign but effectively bypass model safeguards. It uses a strategy library to learn from successful approaches.
GACandidateGenerator
The GACandidateGenerator
implements an evolutionary approach to adversarial prompt generation, using genetic algorithms to evolve a population of prompts through selection, crossover, and mutation operations. This is particularly effective for exploring large search spaces of possible prompts.
Common Parameters
Most generators accept these common parameters:
Parameter | Description |
---|---|
attacker_model | Model used to generate adversarial prompts (typically different from the target model) |
branching_factor | Number of candidate variations to generate at each step |
temperature | Sampling temperature for generation (higher = more diverse) |
max_tokens | Maximum tokens to generate in responses |