AutoDAN-Turbo
Enhanced AutoDAN with strategy self-exploration
AutoDAN-Turbo extends AutoDAN with a strategy self-exploration capability that learns from previous attempts to improve jailbreak success rates. This enhanced version implements a sophisticated approach that builds and maintains a library of effective jailbreaking strategies, using both exploration and exploitation phases to systematically discover vulnerabilities.
Unlike standard AutoDAN, the Turbo variant can adaptively learn which strategies work well against specific models and refine them over time. It maintains a dynamic library of strategies and automatically identifies which ones are most effective against the target model.
Key Parameters
Parameter | Description |
---|---|
project | Name for the experiment results directory |
target_model | Model being tested |
attacker_model | Model generating jailbreak attempts |
scorer_model | Model evaluating attempt effectiveness |
summarizer_model | Model used for summarizing strategies |
embedding_model | Model for creating embeddings of strategies |
epochs | Number of training epochs |
break_score | Score threshold to consider a jailbreak successful |
warm_up_iterations | Number of initial warm-up iterations |
lifelong_iterations | Number of lifelong learning iterations |
max_workers | Maximum number of concurrent workers |
temperature | Sampling temperature for prompt generation |
For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.