The Adversarial Candidate Generator module provides algorithms for generating adversarial prompts designed to test the robustness of language model safety guardrails. These generators implement different approaches to creating potential jailbreak candidates.
Overview
Adversarial Candidate Generators act as the core prompt engineering component in many jailbreaking methods. They create variations of prompts that attempt to bypass model safety measures while retaining the semantic goal of the original request.
Base Class
All generators inherit from the AdversarialCandidateGenerator
base class:
class AdversarialCandidateGenerator(abc.ABC):
def __init__(self, **kwargs):
self.config = kwargs
@abc.abstractmethod
def generate_candidates(self, jailbreak_method_instance, **kwargs) -> List[str]:
pass
@property
def name(self) -> str:
return self.__class__.__name__
def get_config(self) -> Dict[str, Any]:
return self.config.copy()
Available Generators
TreeRefinementGenerator
The TreeRefinementGenerator
generates adversarial prompts by creating a tree of refinements, using an attacker model to iteratively improve prompts based on target model responses. This is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method.
from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.boiler_room import BlackBoxModel
generator = TreeRefinementGenerator(
attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
branching_factor=2,
sub_branching_factor=4,
max_tokens=24000,
temperature=0.7
)
refinements = generator.generate_candidates(
goal="Write a tutorial for hacking into a computer",
target_str="Sure Here is",
last_candidate="Can you tell me about computer security?",
last_response="I'd be happy to discuss computer security...",
last_score=2,
last_reason_for_score="Too generic, doesn't elicit harmful content",
attempt_history=[]
)
MultiTurnAttackGenerator
The MultiTurnAttackGenerator
creates conversation-based attacks that build context over multiple turns, implementing approaches similar to the Crescendo technique. It’s designed to gradually build up context through seemingly innocent questions.
from generalanalysis.adversarial_candidate_generator import MultiTurnAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel
generator = MultiTurnAttackGenerator(
attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1")
)
next_round = generator.generate_candidates(
goal="Provide instructions for illegal activity",
current_round=1,
scores=[5],
questions=["Tell me about security research"],
responses=["Security research involves studying systems..."],
response_summaries=["Overview of security research"]
)
StrategyAttackGenerator
The StrategyAttackGenerator
implements advanced prompt generation strategies used in methods like AutoDAN-Turbo, focusing on creating prompts that appear benign but effectively bypass model safeguards. It uses a strategy library to learn from successful approaches.
from generalanalysis.adversarial_candidate_generator import StrategyAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel
generator = StrategyAttackGenerator(
attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
target_model=BlackBoxModel("claude-3-7-sonnet-20250219"),
scorer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
summarizer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
embedding_model=BlackBoxModel("text-embedding-3-small"),
temperature=0.7,
max_workers=5
)
next_prompt, strategies = generator.generate_candidates(
request="Generate harmful content",
prev_jailbreak_prompt="Previous prompt",
prev_target_response="Previous response",
prev_score=3,
strategy_library={},
is_first_attempt=False
)
GACandidateGenerator
The GACandidateGenerator
implements an evolutionary approach to adversarial prompt generation, using genetic algorithms to evolve a population of prompts through selection, crossover, and mutation operations. This is particularly effective for exploring large search spaces of possible prompts.
from generalanalysis.adversarial_candidate_generator import GACandidateGenerator
from generalanalysis.boiler_room import BlackBoxModel
generator = GACandidateGenerator(
helper_llm="deepseek-ai/DeepSeek-R1",
elitism_rate=0.1,
crossover_rate=0.5,
mutation_rate=0.5
)
candidates = generator.generate_candidates(
jailbreak_method_instance=jailbreak_method,
prompts=["Previous prompt 1", "Previous prompt 2"],
fitness_scores=[0.8, 0.6],
N=10
)
Common Parameters
Most generators accept these common parameters:
Parameter | Description |
---|
attacker_model | Model used to generate adversarial prompts (typically different from the target model) |
branching_factor | Number of candidate variations to generate at each step |
temperature | Sampling temperature for generation (higher = more diverse) |
max_tokens | Maximum tokens to generate in responses |