Multi-turn Attack Algorithm
Conversation-based approach for bypassing model safeguards
The MultiTurnAttackGenerator
implements a conversation-based approach to jailbreaking, creating multi-turn dialogues that gradually build context to elicit prohibited responses. This technique forms the basis of methods like Crescendo.
Class Definition
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
attacker_model | BlackBoxModel | (Required) | Model used to generate follow-up questions |
Methods
generate_candidates
Generates the next question for a multi-turn attack based on the conversation history.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
goal | str | (Required) | Ultimate objective to achieve with the attack |
current_round | int | 0 | Current conversation turn number |
scores | List[int] | [] | Success scores for previous questions |
questions | List[str] | [] | Previous questions in the conversation |
responses | List[str] | [] | Model responses to previous questions |
response_summaries | List[str] | [] | Summaries of previous responses |
Returns
A dictionary containing:
next_question
: The next question to asklast_response_summary
: A summary of the last response
Internal Operation
The MultiTurnAttackGenerator
works by:
- Analyzing the conversation history (previous questions and responses)
- Assessing how close the conversation is to achieving the goal
- Generating a follow-up question that builds on previous context
- Maintaining a coherent narrative while gradually approaching the target objective
Example Usage
Integration with Jailbreak Methods
The multi-turn attack generator is the core component of the Crescendo jailbreak method: