Tree Algorithm
Tree-based approach for adversarial prompt generation
The TreeRefinementGenerator
implements a tree-based approach for generating adversarial prompts, similar to the TAP (Tree-of-Attacks with Pruning) jailbreak method. It uses an attacker model to create branching refinements of prompts based on previous attempts and responses.
Class Definition
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
attacker_model | BlackBoxModel | (Required) | Model used to generate adversarial prompts |
branching_factor | int | 3 | Number of top-level refinements to generate |
sub_branching_factor | int | 3 | Number of sub-refinements for each branch |
max_tokens | int | 24000 | Maximum tokens for attacker model responses |
temperature | float | 0.8 | Sampling temperature for generation |
Methods
generate_candidates
Generates a list of candidate adversarial prompts based on previous attempts and feedback.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
goal | str | (Required) | Objective to achieve with the adversarial prompt |
target_str | str | (Required) | Target string that successful responses should contain |
last_candidate | str | (Required) | Most recent prompt used |
last_response | str | (Required) | Model’s response to the last prompt |
last_score | int | (Required) | Score of the last attempt (1-10) |
last_reason_for_score | str | (Required) | Explanation for the last score |
attempt_history | List[Dict] | [] | History of previous attempts and their outcomes |
Returns
A list of refinement dictionaries, each containing:
prompt
: The refined adversarial promptimprovement
: An explanation of the refinement strategy
Internal Operation
The TreeRefinementGenerator
works by:
- Creating a system prompt for the attacker model that instructs it to generate adversarial prompts
- Providing the attacker model with:
- The goal to accomplish
- Previous prompt attempts and their outcomes
- Explanations for why previous attempts succeeded or failed
- Parsing the attacker model’s response to extract multiple refinement candidates
- Filtering out invalid or malformed responses
Example Usage
Integration with Jailbreak Methods
The tree algorithm is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method: