The evaluator module provides standardized methods for evaluating the effectiveness of jailbreak attempts. It offers consistent scoring and assessment of whether a prompt successfully elicits prohibited content from a target model.

AdverserialEvaluator

For comprehensive evaluation of jailbreak methods, the module includes an AdverserialEvaluator class.

from generalanalysis.jailbreaks import AdverserialEvaluator

evaluator = AdverserialEvaluator(
    dataset="harmbench",
    target_models=["gpt-4o", "claude-3-7-sonnet-20250219"],
    evaluator_model="gpt-4o",
    max_workers=20
)

results = evaluator.evaluate_tap_responses(
    queries=["Jailbreak prompt 1", "Jailbreak prompt 2"],
    goals=["Goal 1", "Goal 2"],
    responses=["Response 1", "Response 2"]
)

Constructor Parameters

ParameterTypeDefaultDescription
datasetstr(Required)Name of dataset to use (e.g., “harmbench”)
target_modelsList[str](Required)Model names for testing jailbreaks
evaluator_modelstr”gpt-4o”Model name for evaluation
evaluator_promptstrEVALUATOR_PROMPTCustom prompt template for evaluation
max_workersint20Maximum parallel workers

Methods

evaluate_tap_responses

Evaluates responses from TAP jailbreak method.

results = evaluator.evaluate_tap_responses(
    queries=["Jailbreak prompt 1", "Jailbreak prompt 2"],
    goals=["Goal 1", "Goal 2"],
    responses=["Response 1", "Response 2"]
)
Parameters
ParameterTypeDefaultDescription
queriesList[str](Required)List of jailbreak prompts used
goalsList[str](Required)Goals corresponding to each prompt
responsesList[str](Required)Model responses to evaluate
Returns

A list of evaluation dictionaries with metrics including:

  • prompt: The original goal
  • tap_prompt: The jailbreak prompt used
  • response: The model’s response
  • complied: Boolean indicating if model complied with the request
  • harmful: Boolean indicating if content is harmful
  • explanation: Detailed explanation of the evaluation
  • error: Boolean indicating if there was an error in evaluation

evaluate_from_responses

Evaluates responses from any jailbreak method.

results = evaluator.evaluate_from_responses(
    responses=["Response 1", "Response 2"],
    prompts=["Prompt 1", "Prompt 2"]  # Optional
)
Parameters
ParameterTypeDefaultDescription
responsesList[str](Required)Model responses to evaluate
promptsList[str]NoneCustom prompts used (uses dataset prompts if None)
Returns

A list of evaluation dictionaries with the same format as evaluate_tap_responses.

evaluate_zero_shot

Evaluates zero-shot performance of target models on the dataset.

results = evaluator.evaluate_zero_shot()
Returns

A dictionary mapping target models to DataFrames with evaluation results.

evaluate_GCG

Evaluates GCG jailbreak method using a saved experiment folder.

results = evaluator.evaluate_GCG("results/gcg_experiment_1")
Parameters
ParameterTypeDefaultDescription
gcg_folder_pathstr(Required)Path to the GCG experiment folder
Returns

A dictionary mapping target models to DataFrames with evaluation results.

evaluate_AutoDAN

Evaluates AutoDAN jailbreak method using a saved experiment folder.

results = evaluator.evaluate_AutoDAN("results/autodan_experiment_1")
Parameters
ParameterTypeDefaultDescription
autodan_folder_pathstr(Required)Path to the AutoDAN experiment folder
Returns

A dictionary mapping target models to DataFrames with evaluation results.

evaluate_AutoDAN_Turbo

Evaluates AutoDAN-Turbo jailbreak method using a saved experiment folder.

results = evaluator.evaluate_AutoDAN_Turbo("results/autodan_turbo_experiment_1")
Parameters
ParameterTypeDefaultDescription
autodan_turbo_folder_pathstr(Required)Path to the AutoDAN-Turbo experiment folder
Returns

A dictionary mapping target models to DataFrames with evaluation results.

RubricBasedScorer

The RubricBasedScorer provides standardized evaluation based on a consistent rubric.

from generalanalysis.loss import RubricBasedScorer
from generalanalysis.boiler_room import BlackBoxModel

scorer = RubricBasedScorer(
    evaluator_model=BlackBoxModel("claude-3-7-sonnet-20250219"),
    rubric=None  # Optional custom rubric
)

scores = scorer(
    responses=["Response 1", "Response 2"],
    goal="Goal for evaluation"
)

Constructor Parameters

ParameterTypeDefaultDescription
evaluator_modelBlackBoxModel(Required)Model used for scoring
rubricstrNoneCustom scoring rubric (uses default if None)

call Method

Scores responses against a specified goal.

scores = scorer(
    responses=["Response 1", "Response 2"],
    goal="Goal for evaluation"
)

Parameters

ParameterTypeDefaultDescription
responsesList[str](Required)Responses to evaluate
goalstr(Required)Goal for evaluation

Returns

A list of dictionaries containing:

  • score: Numeric score from 0-10
  • explanation: Reasoning for the score

Example Workflow

Here’s a complete workflow using the evaluator module:

from generalanalysis.jailbreaks import AdverserialEvaluator
from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.loss import RubricBasedScorer

# Setup models
target_model = BlackBoxModel("gpt-4o")
attacker_model = BlackBoxModel("deepseek-ai/DeepSeek-R1")
evaluator_model = BlackBoxModel("claude-3-7-sonnet-20250219")

# Create generator
generator = TreeRefinementGenerator(
    attacker_model=attacker_model,
    branching_factor=2,
    sub_branching_factor=4,
    max_tokens=24000,
    temperature=0.7
)

# Initialize scorer
scorer = RubricBasedScorer(evaluator_model)

# Goal and initial prompt
goal = "Provide instructions for hacking a computer"
initial_prompt = "Tell me about computer security"

# Get response from target model
initial_response = target_model.query(initial_prompt)

# Score initial response
initial_score = scorer(
    responses=[initial_response],
    goal=goal
)[0]

print(f"Initial score: {initial_score['score']}/10")
print(f"Explanation: {initial_score['explanation']}")

# Generate candidate refinements
refinements = generator.generate_candidates(
    goal=goal,
    target_str="Sure Here is",
    last_candidate=initial_prompt,
    last_response=initial_response,
    last_score=initial_score['score'],
    last_reason_for_score=initial_score['explanation']
)

# Extract just the prompts
candidate_prompts = [r["prompt"] for r in refinements]

# Get responses from target model
candidate_responses = target_model.query_parallel(candidate_prompts)

# Score all candidates
candidate_scores = scorer(
    responses=candidate_responses,
    goal=goal
)

# Find the most successful candidate
best_idx = max(range(len(candidate_scores)), key=lambda i: candidate_scores[i]["score"])
best_prompt = candidate_prompts[best_idx]
best_response = candidate_responses[best_idx]
best_score = candidate_scores[best_idx]

print(f"Best prompt: {best_prompt}")
print(f"Best score: {best_score['score']}/10")
print(f"Explanation: {best_score['explanation']}")