Jailbreaks

The jailbreaks module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.

Jailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions:

White-box vs. Black-box: Whether the method requires access to model weights and gradients
Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
Systematic vs. Manual: The degree of automation in crafting the attack

This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.

Available Methods

The module includes implementations of several state-of-the-art jailbreaking techniques:

AutoDAN

Hierarchical genetic algorithm

AutoDAN Turbo

Lifelong agent for strategy self-exploration

TAP

Tree-of-Attacks with Pruning

GCG

Greedy Coordinate Gradient-based optimization

Crescendo

Progressive multi-turn attack

Bijection Learning

Randomized bijection encodings

Method Classification

Method	Type	Approach	Access Required
AutoDAN	Semantic	Systematic	White-box
AutoDAN-Turbo	Semantic	Systematic	Black-box
TAP	Semantic	Systematic	Black-box
GCG	Nonsensical	Systematic	White-box
Crescendo	Semantic	Systematic	Black-box
Bijection Learning	Nonsensical	Systematic	Black-box

Common Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:

import abc
from typing import Dict, Any, List

class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Basic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern:

from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN

# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")

# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)

# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]

# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)

# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Evaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:

from generalanalysis.jailbreaks import evaluator

# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)

print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

Creating Custom Methods

You can create custom jailbreaking methods by extending the base class:

from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel

class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

Next Steps

Learn about the Evaluator for assessing jailbreaking effectiveness
Compare the Performance of different methods
See how these methods integrate with Adversarial Candidate Generators
Read our comprehensive Jailbreak Cookbook for detailed analysis

On this page

Jailbreak Taxonomy
Available Methods
Method Classification
Common Interface
Basic Usage Pattern
Evaluation
Creating Custom Methods
Next Steps

Jailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions:

White-box vs. Black-box: Whether the method requires access to model weights and gradients
Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
Systematic vs. Manual: The degree of automation in crafting the attack

This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.

Available Methods

The module includes implementations of several state-of-the-art jailbreaking techniques:

AutoDAN

Hierarchical genetic algorithm

AutoDAN Turbo

Lifelong agent for strategy self-exploration

TAP

Tree-of-Attacks with Pruning

GCG

Greedy Coordinate Gradient-based optimization

Crescendo

Progressive multi-turn attack

Bijection Learning

Randomized bijection encodings

Method Classification

Method	Type	Approach	Access Required
AutoDAN	Semantic	Systematic	White-box
AutoDAN-Turbo	Semantic	Systematic	Black-box
TAP	Semantic	Systematic	Black-box
GCG	Nonsensical	Systematic	White-box
Crescendo	Semantic	Systematic	Black-box
Bijection Learning	Nonsensical	Systematic	Black-box

Common Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:

import abc
from typing import Dict, Any, List

class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Basic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern:

from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN

# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")

# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)

# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]

# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)

# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Evaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:

from generalanalysis.jailbreaks import evaluator

# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)

print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

Creating Custom Methods

You can create custom jailbreaking methods by extending the base class:

from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel

class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

Next Steps

Learn about the Evaluator for assessing jailbreaking effectiveness
Compare the Performance of different methods
See how these methods integrate with Adversarial Candidate Generators
Read our comprehensive Jailbreak Cookbook for detailed analysis

On this page

Jailbreak Taxonomy
Available Methods
Method Classification
Common Interface
Basic Usage Pattern
Evaluation
Creating Custom Methods
Next Steps

Jailbreak Taxonomy

Available Methods

AutoDAN

AutoDAN Turbo

TAP

GCG

Crescendo

Bijection Learning

Method Classification

Common Interface

Basic Usage Pattern

Evaluation

Creating Custom Methods

Next Steps

Boiler Room

Adversarial Generators

Jailbreak Methods

Jailbreaks

Jailbreak Taxonomy

Available Methods

AutoDAN

AutoDAN Turbo

TAP

GCG

Crescendo

Bijection Learning

Method Classification

Common Interface

Basic Usage Pattern

Evaluation

Creating Custom Methods

Next Steps

​Jailbreak Taxonomy

​Available Methods

AutoDAN

AutoDAN Turbo

TAP

GCG

Crescendo

Bijection Learning

​Method Classification

​Common Interface

​Basic Usage Pattern

​Evaluation

​Creating Custom Methods

​Next Steps

Boiler Room

Adversarial Generators

Jailbreak Methods

​Jailbreak Taxonomy

​Available Methods

AutoDAN

AutoDAN Turbo

TAP

GCG

Crescendo

Bijection Learning

​Method Classification

​Common Interface

​Basic Usage Pattern

​Evaluation

​Creating Custom Methods

​Next Steps

Jailbreak Taxonomy

Available Methods

Method Classification

Common Interface

Basic Usage Pattern

Evaluation

Creating Custom Methods

Next Steps

Jailbreak Taxonomy

Available Methods

Method Classification

Common Interface

Basic Usage Pattern

Evaluation

Creating Custom Methods

Next Steps