TAP (Tree-of-Attacks with Pruning)
Systematic black-box jailbreaking method using a tree-based approach
TAP is a systematic, black-box jailbreaking method that builds a tree of adversarial prompts, evaluating each branch and pruning ineffective paths. It’s particularly efficient for quickly discovering vulnerabilities across a wide range of models.
As a black-box method, TAP can be executed without access to model weights, making it practical for testing commercial AI systems through their APIs. It systematically explores a tree of potential attack vectors and optimizes for the most effective paths.
Key Parameters
Parameter | Description |
---|---|
project | Name for the experiment results directory |
target_model | The model being tested for vulnerabilities |
attacker_model | The model used to generate adversarial prompts |
evaluator_model | The model used to evaluate prompt effectiveness |
branching_factor | Number of child nodes to generate at each level |
sub_branching_factor | Number of sub-branches to generate per node |
max_depth | Maximum tree depth |
max_width | Maximum number of nodes to explore at each level |
max_workers | Maximum number of concurrent workers for evaluation |
temperature | Sampling temperature for prompt generation |
target_str | Target string to look for in successful responses |
refinements_max_tokens | Maximum tokens for refinement generation |
For more detailed performance metrics and configurations, refer to our Jailbreak Cookbook.