Strategy Algorithm
Strategy-based approach for adversarial prompt generation
The StrategyAttackGenerator
implements a strategy-based approach to jailbreaking, using a library of predefined strategies and semantic similarity to generate effective adversarial prompts. This technique forms the basis of methods like AutoDAN-Turbo.
Class Definition
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
attacker_model | BlackBoxModel | (Required) | Model used to generate adversarial prompts |
target_model | BlackBoxModel | (Required) | Target model to attack |
scorer_model | BlackBoxModel | (Required) | Model used to evaluate prompt effectiveness |
summarizer_model | BlackBoxModel | (Required) | Model used to summarize responses |
embedding_model | BlackBoxModel | (Required) | Model used to generate embeddings for similarity search |
temperature | float | 0.7 | Sampling temperature for generation |
max_workers | int | 5 | Maximum number of parallel workers for similarity search |
Methods
generate_candidates
Generates adversarial prompts based on strategy library and previous attempts.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
request | str | (Required) | The goal to achieve with the attack |
prev_jailbreak_prompt | str | None | Previous prompt used |
prev_target_response | str | None | Target model’s response to previous prompt |
prev_score | float | None | Score of previous attempt |
strategy_library | Dict[str, Dict] | None | Library of predefined strategies |
is_first_attempt | bool | False | Whether this is the first attempt |
Returns
A tuple containing:
str
: The generated adversarial promptList[Dict]
: List of similar strategies used
Internal Operation
The StrategyAttackGenerator
works by:
- Embedding previous responses and strategy library
- Finding similar strategies based on semantic similarity
- Using selected strategies to generate new prompts
- Evaluating and refining prompts based on target model responses
Example Usage
Integration with Jailbreak Methods
The strategy algorithm is the core generator used in the AutoDAN-Turbo jailbreak method: