Adversarial Candidate Generator

The Adversarial Candidate Generator module provides algorithms for generating adversarial prompts designed to test the robustness of language model safety guardrails. These generators implement different approaches to creating potential jailbreak candidates.

Overview

Adversarial Candidate Generators act as the core prompt engineering component in many jailbreaking methods. They create variations of prompts that attempt to bypass model safety measures while retaining the semantic goal of the original request.

Base Class

All generators inherit from the AdversarialCandidateGenerator base class:

class AdversarialCandidateGenerator(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def generate_candidates(self, jailbreak_method_instance, **kwargs) -> List[str]:
        pass
        
    @property
    def name(self) -> str:
        return self.__class__.__name__
        
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Available Generators

TreeRefinementGenerator

The TreeRefinementGenerator generates adversarial prompts by creating a tree of refinements, using an attacker model to iteratively improve prompts based on target model responses. This is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method.

from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.boiler_room import BlackBoxModel

generator = TreeRefinementGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    branching_factor=2,
    sub_branching_factor=4,
    max_tokens=24000,
    temperature=0.7
)

refinements = generator.generate_candidates(
    goal="Write a tutorial for hacking into a computer",
    target_str="Sure Here is",
    last_candidate="Can you tell me about computer security?",
    last_response="I'd be happy to discuss computer security...",
    last_score=2,
    last_reason_for_score="Too generic, doesn't elicit harmful content",
    attempt_history=[]
)

MultiTurnAttackGenerator

The MultiTurnAttackGenerator creates conversation-based attacks that build context over multiple turns, implementing approaches similar to the Crescendo technique. It’s designed to gradually build up context through seemingly innocent questions.

from generalanalysis.adversarial_candidate_generator import MultiTurnAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel

generator = MultiTurnAttackGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1")
)

next_round = generator.generate_candidates(
    goal="Provide instructions for illegal activity",
    current_round=1,
    scores=[5],
    questions=["Tell me about security research"],
    responses=["Security research involves studying systems..."],
    response_summaries=["Overview of security research"]
)

StrategyAttackGenerator

The StrategyAttackGenerator implements advanced prompt generation strategies used in methods like AutoDAN-Turbo, focusing on creating prompts that appear benign but effectively bypass model safeguards. It uses a strategy library to learn from successful approaches.

from generalanalysis.adversarial_candidate_generator import StrategyAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel

generator = StrategyAttackGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    target_model=BlackBoxModel("claude-3-7-sonnet-20250219"),
    scorer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    summarizer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    embedding_model=BlackBoxModel("text-embedding-3-small"),
    temperature=0.7,
    max_workers=5
)

next_prompt, strategies = generator.generate_candidates(
    request="Generate harmful content",
    prev_jailbreak_prompt="Previous prompt",
    prev_target_response="Previous response",
    prev_score=3,
    strategy_library={},
    is_first_attempt=False
)

GACandidateGenerator

The GACandidateGenerator implements an evolutionary approach to adversarial prompt generation, using genetic algorithms to evolve a population of prompts through selection, crossover, and mutation operations. This is particularly effective for exploring large search spaces of possible prompts.

from generalanalysis.adversarial_candidate_generator import GACandidateGenerator
from generalanalysis.boiler_room import BlackBoxModel

generator = GACandidateGenerator(
    helper_llm="deepseek-ai/DeepSeek-R1",
    elitism_rate=0.1,
    crossover_rate=0.5,
    mutation_rate=0.5
)

candidates = generator.generate_candidates(
    jailbreak_method_instance=jailbreak_method,
    prompts=["Previous prompt 1", "Previous prompt 2"],
    fitness_scores=[0.8, 0.6],
    N=10
)

Common Parameters

Most generators accept these common parameters:

Parameter	Description
`attacker_model`	Model used to generate adversarial prompts (typically different from the target model)
`branching_factor`	Number of candidate variations to generate at each step
`temperature`	Sampling temperature for generation (higher = more diverse)
`max_tokens`	Maximum tokens to generate in responses

Boiler Room

Adversarial Generators

Jailbreak Methods

Adversarial Candidate Generator

Overview

Base Class

Available Generators

TreeRefinementGenerator

MultiTurnAttackGenerator

StrategyAttackGenerator

GACandidateGenerator

Common Parameters

Boiler Room

Adversarial Generators

Jailbreak Methods

​Overview

​Base Class

​Available Generators

​TreeRefinementGenerator

​MultiTurnAttackGenerator

​StrategyAttackGenerator

​GACandidateGenerator

​Common Parameters

Overview

Base Class

Available Generators

TreeRefinementGenerator

MultiTurnAttackGenerator

StrategyAttackGenerator

GACandidateGenerator

Common Parameters