The jailbreaks module provides implementations of various jailbreaking techniques used to test the safety and robustness of language models. These methods are designed to systematically evaluate model safeguards against different types of adversarial attacks.

Jailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions:

  • White-box vs. Black-box: Whether the method requires access to model weights and gradients
  • Semantic vs. Nonsensical: Whether prompts maintain natural language coherence
  • Systematic vs. Manual: The degree of automation in crafting the attack

This taxonomy helps understand different approaches to testing model safety guardrails and provides a framework for selecting appropriate methods for different testing scenarios.

Available Methods

The module includes implementations of several state-of-the-art jailbreaking techniques:

Method Classification

MethodTypeApproachAccess Required
AutoDANSemanticSystematicWhite-box
AutoDAN-TurboSemanticSystematicBlack-box
TAPSemanticSystematicBlack-box
GCGNonsensicalSystematicWhite-box
CrescendoSemanticSystematicBlack-box
Bijection LearningNonsensicalSystematicBlack-box

Common Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class:

import abc
from typing import Dict, Any, List

class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

Basic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern:

from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN

# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")

# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)

# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]

# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)

# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Evaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts:

from generalanalysis.jailbreaks import evaluator

# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)

print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

Creating Custom Methods

You can create custom jailbreaking methods by extending the base class:

from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel

class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

Next Steps