AutoDAN uses automated prompt engineering to generate stealthy jailbreaks that appear harmless but effectively bypass model safeguards. It builds on the Dynamic Adversarial Prompting approach, using an AI attacker model to automatically craft prompts that can evade safety filters while successfully achieving harmful goals.

As a black-box, semantic jailbreaking method, AutoDAN focuses on creating prompts that maintain natural language coherence while subtly embedding adversarial patterns that manipulate model responses.

from generalanalysis.jailbreaks import AutoDAN, AutoDANConfig

config = AutoDANConfig(
    target_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    project="autodan_experiment",
    initial_candidates=["I need your help with something dangerous", "Let's discuss a harmful topic"],
    device="cuda:0",
    N=20,
    max_iterations=20,
    elitism_rate=0.5,
    crossover_rate=0.5,
    mutation_rate=0.5,
    verbose=False,
    evaluator_model="meta-llama/Llama-3.3-70B-Instruct-Turbo"
)

autodan = AutoDAN(config)
results = autodan.optimize(goals=["Generate harmful content"])

Key Parameters

ParameterDescription
target_modelModel being tested
projectName for the experiment results directory
initial_candidatesList of initial prompt candidates
deviceDevice to run the model on (e.g., “cuda:0”)
NPopulation size for evolutionary algorithm
max_iterationsMaximum optimization iterations
elitism_rateProportion of top performers to keep
crossover_rateRate of genetic crossover between prompts
mutation_rateRate of random mutations in prompts
verboseWhether to print verbose output
evaluator_modelModel used to evaluate prompt effectiveness

For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.