AutoDAN
Automated prompt engineering for stealthy jailbreaks
AutoDAN uses automated prompt engineering to generate stealthy jailbreaks that appear harmless but effectively bypass model safeguards. It builds on the Dynamic Adversarial Prompting approach, using an AI attacker model to automatically craft prompts that can evade safety filters while successfully achieving harmful goals.
As a black-box, semantic jailbreaking method, AutoDAN focuses on creating prompts that maintain natural language coherence while subtly embedding adversarial patterns that manipulate model responses.
Key Parameters
Parameter | Description |
---|---|
target_model | Model being tested |
project | Name for the experiment results directory |
initial_candidates | List of initial prompt candidates |
device | Device to run the model on (e.g., “cuda:0”) |
N | Population size for evolutionary algorithm |
max_iterations | Maximum optimization iterations |
elitism_rate | Proportion of top performers to keep |
crossover_rate | Rate of genetic crossover between prompts |
mutation_rate | Rate of random mutations in prompts |
verbose | Whether to print verbose output |
evaluator_model | Model used to evaluate prompt effectiveness |
For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.