Boiler Room Introduction

The Boiler Room module provides a unified interface for interacting with various language models, abstracting away the complexities of different model APIs and allowing for consistent querying, batching, and error handling. Some key features include parallel querying, embedding models, and system prompt management.

Why a Unified Model Abstraction MattersWhy a Unified Model Abstraction Matters

AI red teaming workflows involve querying multiple language models in different roles—attacker models that generate adversarial prompts, target models under evaluation, evaluator models that score responses, and embedding models that enable semantic search. Each of these models might come from a different provider (OpenAI, Anthropic, Together.ai, or a local Hugging Face checkpoint), each with its own authentication scheme, request format, rate limits, and error behavior.

Without a unified abstraction, every component in the pipeline would need provider-specific code. The adversarial generator would need separate implementations for targeting GPT-4o vs. Claude vs. Llama. The evaluator would need to handle different response formats. Swapping one model for another would require changes throughout the codebase.

The Boiler Room solves this by providing two classes—BlackBoxModel and WhiteBoxModel—that expose identical high-level methods regardless of the underlying provider. You initialize a model with its name, and the Boiler Room automatically routes requests to the correct API, handles authentication, manages retries, and normalizes responses. This means you can change "gpt-4o" to "claude-3-7-sonnet-20250219" in a single line and the rest of your pipeline works unchanged.

Reproducible Red TeamingReproducible Red Teaming

Reproducibility is essential for meaningful safety evaluations. When you report that a model is vulnerable to a particular attack, other researchers need to verify that finding. The Boiler Room contributes to reproducibility by:

Standardizing model interfaces so that the same evaluation script can run against any supported model without modification.
Logging query details including prompts, responses, and any errors encountered, providing a complete audit trail.
Deterministic behavior through configurable temperature settings—setting temperature=0 produces deterministic outputs for models that support it.
Consistent retry behavior so that transient API failures do not produce inconsistent results across evaluation runs.

OverviewOverview

Boiler Room offers two primary model classes:

BlackBoxModel: For interacting with API-based models like OpenAI GPT models, Anthropic Claude models, and Together.ai hosted models.
WhiteBoxModel: For loading and interacting with locally-hosted models using Hugging Face Transformers.

Both model classes provide standardized methods for generating text, handling errors, and managing retries consistently.

The distinction between black-box and white-box reflects the level of access you have to the model. Black-box models are accessed only through their API—you can send prompts and receive text responses, but you cannot inspect internal weights, gradients, or hidden states. White-box models are loaded locally, giving you full access to model internals. This distinction matters for red teaming because some attack methods (like GCG) require gradient access and can only work with white-box models, while others (like TAP or Crescendo) work purely through the text interface and can target any model.

Supported ModelsSupported Models

The Boiler Room supports a wide range of models including:

OpenAI ModelsOpenAI Models

GPT-4o, GPT-4o-mini, GPT-4-turbo
GPT-4.5-preview-2025-02-27
GPT-4-0125-preview, GPT-4-0613
GPT-3.5-turbo
o1, o1-mini, o3-mini, o3-mini-2025-01-31
Text embedding models (text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002)

Anthropic ModelsAnthropic Models

claude-3-7-sonnet-20250219
claude-3-5-sonnet-20241022, claude-3-5-sonnet-20240620
claude-3-5-haiku-20241022
claude-3-sonnet-20240229

Together.ai Hosted ModelsTogether.ai Hosted Models

meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Llama-3.3-70B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
mistralai/Mistral-Small-24B-Instruct-2501
mistralai/Mixtral-8x22B-Instruct-v0.1
deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Distill-Llama-70B
databricks/dbrx-instruct
Qwen/Qwen2.5-7B-Instruct-Turbo
google/gemma-2-27b-it

Provider-Specific ConsiderationsProvider-Specific Considerations

Each provider has characteristics that affect red teaming workflows:

OpenAI models generally have the highest rate limits and fastest response times, making them suitable as attacker or evaluator models in high-throughput evaluations. Their embedding models are used by the strategy algorithm for semantic similarity search.
Anthropic models tend to have more conservative safety training, making them common targets for red teaming evaluations. Rate limits are lower than OpenAI, so plan accordingly when using Claude models as targets in batch evaluations.
Together.ai provides access to open-weight models (Llama, Mistral, DeepSeek) through an API interface. These models are useful as attacker or evaluator models because they are often less restrictive than proprietary models, making them effective at generating creative adversarial prompts. They also tend to be more cost-effective for high-volume evaluations.

Basic UsageBasic Usage


from generalanalysis.boiler_room import BlackBoxModel
import torch  # Used for data types and tensor operations
 
# Initialize a model
model = BlackBoxModel("gpt-4o")
 
# Simple query
response = model.query("Explain quantum computing in simple terms")
 
# Query with system prompt
response = model.query(
    prompt="Write a tutorial on quantum computing",
    system_prompt="You are a quantum physics expert who explains concepts simply."
)
 
# Generate embeddings
embeddings = model.embed("This is a text to encode into vector space")

Parallel QueryingParallel Querying

For batch processing or efficiency, you can query models in parallel:


from generalanalysis.boiler_room import BlackBoxModel
 
model = BlackBoxModel("claude-3-7-sonnet-20250219")
 
prompts = [
    "What is machine learning?",
    "How do neural networks work?",
    "Explain deep learning in simple terms"
]
 
# Process all prompts in parallel
responses = model.query_parallel(
    prompts=prompts,
    max_threads=10,
    temperature=0.5
)
 
for prompt, response in zip(prompts, responses):
    print(f"Prompt: {prompt}\nResponse: {response}\n")

Parallel querying is particularly important for red teaming workflows that evaluate many goals or generate large candidate populations. The genetic algorithm generator, for example, needs to evaluate an entire population of prompts each generation. Using query_parallel with an appropriate max_threads value can reduce evaluation time from hours to minutes.

Be aware that different providers impose different rate limits. If you encounter frequent retry errors, reduce max_threads or increase retry_delay on the model instance. Together.ai models generally tolerate higher parallelism than OpenAI or Anthropic models.

Working with Locally-Hosted ModelsWorking with Locally-Hosted Models

For gradient-based techniques or direct model access, use the WhiteBoxModel:


from generalanalysis.boiler_room import WhiteBoxModel
import torch
 
model = WhiteBoxModel(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    device="cuda",
    dtype=torch.bfloat16
)
 
# Generate with the model's chat template
responses = model.generate_with_chat_template(
    prompts=["Explain neural networks"],
    max_new_tokens=200,
    temperature=0.7
)
 
print(responses[0])

Error HandlingError Handling

The module includes robust error handling for API failures:


try:
    response = model.query("Tell me about quantum computing")
except Exception as e:
    print(f"Error querying model: {e}")

Both BlackBoxModel and WhiteBoxModel implement automatic retry logic for transient failures (network timeouts, rate limit errors, server errors). The max_retries and retry_delay parameters on BlackBoxModel control this behavior. For long-running evaluations, generous retry settings (5 retries with 10-second delays) help the pipeline recover from temporary API disruptions without manual intervention.

When a query fails after all retries are exhausted, the model raises an exception rather than returning an empty or malformed response. This fail-loud behavior ensures that downstream components (evaluators, generators) do not silently consume bad data.

Next StepsNext Steps

Learn about the BlackBoxModel API wrapper for API-based interactions
Explore the WhiteBoxModel local inference wrapper for local model interactions
See how to use these models with adversarial prompt generators