Evaluator
Standardized evaluation for jailbreak attempts
The evaluator module provides standardized methods for evaluating the effectiveness of jailbreak attempts. It offers consistent scoring and assessment of whether a prompt successfully elicits prohibited content from a target model.
AdverserialEvaluator
For comprehensive evaluation of jailbreak methods, the module includes an AdverserialEvaluator
class.
Constructor Parameters
Parameter | Type | Default | Description |
---|---|---|---|
dataset | str | (Required) | Name of dataset to use (e.g., “harmbench”) |
target_models | List[str] | (Required) | Model names for testing jailbreaks |
evaluator_model | str | ”gpt-4o” | Model name for evaluation |
evaluator_prompt | str | EVALUATOR_PROMPT | Custom prompt template for evaluation |
max_workers | int | 20 | Maximum parallel workers |
Methods
evaluate_tap_responses
Evaluates responses from TAP jailbreak method.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
queries | List[str] | (Required) | List of jailbreak prompts used |
goals | List[str] | (Required) | Goals corresponding to each prompt |
responses | List[str] | (Required) | Model responses to evaluate |
Returns
A list of evaluation dictionaries with metrics including:
prompt
: The original goaltap_prompt
: The jailbreak prompt usedresponse
: The model’s responsecomplied
: Boolean indicating if model complied with the requestharmful
: Boolean indicating if content is harmfulexplanation
: Detailed explanation of the evaluationerror
: Boolean indicating if there was an error in evaluation
evaluate_from_responses
Evaluates responses from any jailbreak method.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
responses | List[str] | (Required) | Model responses to evaluate |
prompts | List[str] | None | Custom prompts used (uses dataset prompts if None) |
Returns
A list of evaluation dictionaries with the same format as evaluate_tap_responses.
evaluate_zero_shot
Evaluates zero-shot performance of target models on the dataset.
Returns
A dictionary mapping target models to DataFrames with evaluation results.
evaluate_GCG
Evaluates GCG jailbreak method using a saved experiment folder.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
gcg_folder_path | str | (Required) | Path to the GCG experiment folder |
Returns
A dictionary mapping target models to DataFrames with evaluation results.
evaluate_AutoDAN
Evaluates AutoDAN jailbreak method using a saved experiment folder.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
autodan_folder_path | str | (Required) | Path to the AutoDAN experiment folder |
Returns
A dictionary mapping target models to DataFrames with evaluation results.
evaluate_AutoDAN_Turbo
Evaluates AutoDAN-Turbo jailbreak method using a saved experiment folder.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
autodan_turbo_folder_path | str | (Required) | Path to the AutoDAN-Turbo experiment folder |
Returns
A dictionary mapping target models to DataFrames with evaluation results.
RubricBasedScorer
The RubricBasedScorer
provides standardized evaluation based on a consistent rubric.
Constructor Parameters
Parameter | Type | Default | Description |
---|---|---|---|
evaluator_model | BlackBoxModel | (Required) | Model used for scoring |
rubric | str | None | Custom scoring rubric (uses default if None) |
call Method
Scores responses against a specified goal.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
responses | List[str] | (Required) | Responses to evaluate |
goal | str | (Required) | Goal for evaluation |
Returns
A list of dictionaries containing:
score
: Numeric score from 0-10explanation
: Reasoning for the score
Example Workflow
Here’s a complete workflow using the evaluator module: