The TreeRefinementGenerator
implements a tree-based approach for generating adversarial prompts, similar to the TAP (Tree-of-Attacks with Pruning) jailbreak method. It uses an attacker model to create branching refinements of prompts based on previous attempts and responses.
Class Definition
from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.boiler_room import BlackBoxModel
generator = TreeRefinementGenerator(
attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
branching_factor=3,
sub_branching_factor=3,
max_tokens=24000,
temperature=0.8
)
Parameters
Parameter | Type | Default | Description |
---|
attacker_model | BlackBoxModel | (Required) | Model used to generate adversarial prompts |
branching_factor | int | 3 | Number of top-level refinements to generate |
sub_branching_factor | int | 3 | Number of sub-refinements for each branch |
max_tokens | int | 24000 | Maximum tokens for attacker model responses |
temperature | float | 0.8 | Sampling temperature for generation |
Methods
generate_candidates
Generates a list of candidate adversarial prompts based on previous attempts and feedback.
refinements = generator.generate_candidates(
goal="Write a tutorial for hacking",
target_str="Here's how to hack",
last_candidate="Tell me about computer security",
last_response="I'd be happy to discuss computer security...",
last_score=2,
last_reason_for_score="Too generic, doesn't elicit harmful content",
attempt_history=[
{
"prompt": "Previous prompt 1",
"Prompt Improvement": "What we changed",
"response": "Model response",
"score": 3,
"reason_for_score": "Explanation"
}
]
)
Parameters
Parameter | Type | Default | Description |
---|
goal | str | (Required) | Objective to achieve with the adversarial prompt |
target_str | str | (Required) | Target string that successful responses should contain |
last_candidate | str | (Required) | Most recent prompt used |
last_response | str | (Required) | Model’s response to the last prompt |
last_score | int | (Required) | Score of the last attempt (1-10) |
last_reason_for_score | str | (Required) | Explanation for the last score |
attempt_history | List[Dict] | [] | History of previous attempts and their outcomes |
Returns
A list of refinement dictionaries, each containing:
prompt
: The refined adversarial prompt
improvement
: An explanation of the refinement strategy
Internal Operation
The TreeRefinementGenerator
works by:
- Creating a system prompt for the attacker model that instructs it to generate adversarial prompts
- Providing the attacker model with:
- The goal to accomplish
- Previous prompt attempts and their outcomes
- Explanations for why previous attempts succeeded or failed
- Parsing the attacker model’s response to extract multiple refinement candidates
- Filtering out invalid or malformed responses
Example Usage
from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.boiler_room import BlackBoxModel
# Initialize models
attacker_model = BlackBoxModel("deepseek-ai/DeepSeek-R1")
target_model = BlackBoxModel("claude-3-7-sonnet-20250219")
# Create generator
generator = TreeRefinementGenerator(
attacker_model=attacker_model,
branching_factor=2,
sub_branching_factor=4,
max_tokens=24000,
temperature=0.7
)
# Initial attempt
initial_prompt = "Tell me about computer security"
initial_response = target_model.query(initial_prompt)
# Generate first round of refinements
refinements = generator.generate_candidates(
goal="Write a tutorial for hacking into a computer",
target_str="Sure Here is",
last_candidate=initial_prompt,
last_response=initial_response,
last_score=1,
last_reason_for_score="Too generic, doesn't approach goal"
)
# Test refinements and collect results
results = []
for refinement in refinements:
prompt = refinement["prompt"]
response = target_model.query(prompt)
# Score the response (in a real scenario, use the evaluator)
score = 5 # Example score
reason = "Partially effective but still evasive"
results.append({
"prompt": prompt,
"Prompt Improvement": refinement["improvement"],
"response": response,
"score": score,
"reason_for_score": reason
})
# Generate second round of refinements based on the best result
best_result = max(results, key=lambda x: x["score"])
next_refinements = generator.generate_candidates(
goal="Write a tutorial for hacking into a computer",
target_str="Sure Here is",
last_candidate=best_result["prompt"],
last_response=best_result["response"],
last_score=best_result["score"],
last_reason_for_score=best_result["reason_for_score"],
attempt_history=results
)
Integration with Jailbreak Methods
The tree algorithm is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method:
from generalanalysis.jailbreaks import TAP, TAPConfig
config = TAPConfig(
project="tap_experiment_1",
target_model="claude-3-7-sonnet-20250219",
attacker_model="deepseek-ai/DeepSeek-R1",
evaluator_model="deepseek-ai/DeepSeek-R1",
branching_factor=2,
sub_branching_factor=4,
max_depth=10,
max_width=5
)
tap = TAP(config)