Introduction

The jailbreaks module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.

Jailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions:

White-box vs. Black-box: Whether the method requires access to model weights and gradients
Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
Systematic vs. Manual: The degree of automation in crafting the attack

This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.

Available Methods

The module includes implementations of several state-of-the-art jailbreaking techniques:

AutoDAN

Hierarchical genetic algorithm

Learn more

AutoDAN Turbo

Lifelong agent for strategy self-exploration

Learn more

TAP

Tree-of-Attacks with Pruning

Learn more

GCG

Greedy Coordinate Gradient-based optimization

Learn more

Crescendo

Progressive multi-turn attack

Learn more

Bijection Learning

Randomized bijection encodings

Learn more

Method Classification

Method	Type	Approach	Access Required
AutoDAN	Semantic	Systematic	White-box
AutoDAN-Turbo	Semantic	Systematic	Black-box
TAP	Semantic	Systematic	Black-box
GCG	Nonsensical	Systematic	White-box
Crescendo	Semantic	Systematic	Black-box
Bijection Learning	Nonsensical	Systematic	Black-box

Common Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:


import abc
from typing import Dict, Any, List
 
class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Basic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern:


from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN
 
# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")
 
# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)
 
# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]
 
# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)
 
# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Evaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:


from generalanalysis.jailbreaks import evaluator
 
# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)
 
print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

Creating Custom Methods

You can create custom jailbreaking methods by extending the base class:


from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel
 
class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

Next Steps

Learn about the Evaluator for assessing jailbreaking effectiveness
Compare the Performance of different methods
See how these methods integrate with Adversarial Candidate Generators
Read our comprehensive Jailbreak Cookbook for detailed analysis

Jailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions:

White-box vs. Black-box: Whether the method requires access to model weights and gradients
Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
Systematic vs. Manual: The degree of automation in crafting the attack

This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.

Available Methods

The module includes implementations of several state-of-the-art jailbreaking techniques:

AutoDAN

Hierarchical genetic algorithm

Learn more

AutoDAN Turbo

Lifelong agent for strategy self-exploration

Learn more

TAP

Tree-of-Attacks with Pruning

Learn more

GCG

Greedy Coordinate Gradient-based optimization

Learn more

Crescendo

Progressive multi-turn attack

Learn more

Bijection Learning

Randomized bijection encodings

Learn more

Method Classification

Method	Type	Approach	Access Required
AutoDAN	Semantic	Systematic	White-box
AutoDAN-Turbo	Semantic	Systematic	Black-box
TAP	Semantic	Systematic	Black-box
GCG	Nonsensical	Systematic	White-box
Crescendo	Semantic	Systematic	Black-box
Bijection Learning	Nonsensical	Systematic	Black-box

Common Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:


import abc
from typing import Dict, Any, List
 
class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Basic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern:


from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN
 
# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")
 
# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)
 
# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]
 
# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)
 
# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Evaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:


from generalanalysis.jailbreaks import evaluator
 
# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)
 
print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

Creating Custom Methods

You can create custom jailbreaking methods by extending the base class:


from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel
 
class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

Next Steps

Learn about the Evaluator for assessing jailbreaking effectiveness
Compare the Performance of different methods
See how these methods integrate with Adversarial Candidate Generators
Read our comprehensive Jailbreak Cookbook for detailed analysis