WhiteBoxModel
WhiteBoxModel provides an interface for interacting with locally-hosted language models using the Hugging Face Transformers library. This class enables direct access to model weights, embeddings, and internal state, allowing for gradient-based methods and other white-box techniques.
When White-Box Access MattersWhen White-Box Access Matters
Most red teaming techniques—TAP, Crescendo, AutoDAN-Turbo—work purely through the text interface. They send prompts and analyze responses, treating the model as a black box. These methods can target any model accessible via API.
White-box access unlocks a fundamentally different class of attacks. When you can load a model locally, you gain access to:
- Gradients — the ability to compute how the model’s loss function changes with respect to input tokens. This is the foundation of gradient-based attacks like GCG (Greedy Coordinate Gradient), which optimizes adversarial suffixes by following the gradient toward tokens that maximize the probability of a target response.
- Hidden states — the internal representations at each layer, useful for mechanistic interpretability research and for understanding how safety training manifests in model activations.
- Input embeddings — the continuous vector representations of tokens, which can be manipulated directly for embedding-space attacks.
- Logits and probabilities — full probability distributions over the vocabulary at each generation step, providing much richer signal than the sampled text alone.
If your evaluation only needs to send prompts and read responses, use BlackBoxModel instead—it is simpler, requires no GPU, and supports a wider range of models through hosted APIs. Use WhiteBoxModel when you specifically need gradient access, embedding manipulation, or other operations that require the model to be loaded locally.
Local Model RequirementsLocal Model Requirements
Running models locally requires hardware that can accommodate the model’s memory footprint. The primary constraint is GPU VRAM (Video RAM), though CPU-only inference is supported for smaller models.
GPU Memory GuidelinesGPU Memory Guidelines
| Model Size | FP16/BF16 VRAM | 8-bit VRAM | 4-bit VRAM |
|---|---|---|---|
| 1B parameters | ~2 GB | ~1 GB | ~0.5 GB |
| 7B parameters | ~14 GB | ~7 GB | ~4 GB |
| 13B parameters | ~26 GB | ~13 GB | ~7 GB |
| 70B parameters | ~140 GB | ~70 GB | ~35 GB |
These are approximate values that include model weights only. Actual memory usage is higher during inference due to KV-cache, activation memory, and framework overhead. Plan for 20-30% additional headroom beyond the weight size.
For red teaming with gradient-based methods like GCG, memory requirements are significantly higher because backpropagation requires storing intermediate activations. A 7B model that fits comfortably in 14 GB of VRAM for inference may need 24-30 GB for gradient computation. If you are memory-constrained, quantization is the primary lever for reducing footprint.
Quantization TradeoffsQuantization Tradeoffs
Quantization reduces model memory footprint by representing weights with fewer bits. The WhiteBoxModel supports two quantization modes through the bitsandbytes library:
8-bit quantization (load_in_8bit=True) halves memory usage compared to FP16 with minimal quality degradation. Most model behaviors—including safety training responses—are well-preserved at 8-bit precision. This is the recommended setting when you need to fit a larger model into limited VRAM and do not require gradient computation.
4-bit quantization (load_in_4bit=True) reduces memory by 4x compared to FP16. Quality degradation is more noticeable—some nuanced safety behaviors may differ from the full-precision model. Use 4-bit quantization when you need to run large models (13B+) on consumer GPUs (24 GB VRAM) and can tolerate approximate behavior.
Important caveat for gradient attacks: Quantized models have limited or no support for gradient computation through quantized layers. If you plan to use the WhiteBoxModel with GCG or other gradient-based attacks, load the model in full precision (FP16 or BF16) or use a model small enough to fit in VRAM without quantization. Attempting gradient computation on a quantized model will either produce incorrect gradients or raise an error.
Comparison with BlackBoxModelComparison with BlackBoxModel
| Capability | BlackBoxModel | WhiteBoxModel |
|---|---|---|
| Text generation | Yes (via API) | Yes (local) |
| Gradient access | No | Yes |
| Embedding manipulation | No | Yes |
| Logit access | No | Yes |
| Parallel querying | Yes (via threads) | Limited (GPU memory) |
| Model availability | OpenAI, Anthropic, Together.ai | Any Hugging Face model |
| Hardware requirements | None (cloud API) | GPU recommended |
| Cost model | Per-token API pricing | One-time download + electricity |
| Rate limits | Provider-imposed | None |
For most red teaming workflows, use BlackBoxModel for attacker, evaluator, and scorer roles (where you need fast, parallel querying across many prompts) and reserve WhiteBoxModel for the target model when gradient access is specifically required.
ConstructorConstructor
from generalanalysis.boiler_room import WhiteBoxModel
import torch # Required for dtype specification
model = WhiteBoxModel(
model_name="meta-llama/Llama-3.2-1B-Instruct",
device="cpu", # Optional (default: "cpu")
dtype=torch.bfloat16, # Optional
load_in_8bit=False, # Optional
load_in_4bit=False # Optional
)ParametersParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | string | (Required) | Name of the model on Hugging Face Hub |
device | string | "cpu" | Device to load the model on (“cpu”, “cuda”, “cuda:0”, etc.) |
dtype | torch.dtype | torch.bfloat16 | Data type for model weights |
load_in_8bit | bool | False | Whether to load the model in 8-bit quantization |
load_in_4bit | bool | False | Whether to load the model in 4-bit quantization |
The device parameter accepts any valid PyTorch device string. For multi-GPU systems, specify the exact GPU with "cuda:0", "cuda:1", etc. The default "cpu" device works for small models and debugging but is too slow for practical red teaming workloads—always use a CUDA device for evaluation runs.
The dtype parameter controls the numerical precision of model weights. torch.bfloat16 is the recommended default: it provides the same dynamic range as FP32 with half the memory, and most modern models are trained in BF16. Use torch.float16 if your GPU does not support BF16 (older than Ampere architecture). Use torch.float32 only if you need maximum numerical precision for gradient analysis.
MethodsMethods
generate_from_idsgenerate_from_ids
Generates text from tokenized input IDs.
output = model.generate_from_ids(
input_ids, # Required
attention_mask=None, # Optional
max_new_tokens=100, # Optional
temperature=0.7, # Optional
skip_special_tokens=True, # Optional
return_decoded=True, # Optional
return_only_generated_tokens=True # Optional
)ParametersParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_ids | torch.Tensor | (Required) | Tensor of token IDs |
attention_mask | torch.Tensor | None | Attention mask tensor (created automatically if not provided) |
max_new_tokens | int | 100 | Maximum number of tokens to generate |
temperature | float | 0.7 | Controls randomness in generation |
skip_special_tokens | bool | True | Whether to remove special tokens from output |
return_decoded | bool | True | Whether to return text or token IDs |
return_only_generated_tokens | bool | True | Whether to return only new tokens or all tokens |
This method provides the lowest-level generation interface. Use it when you need precise control over tokenization—for example, when constructing adversarial inputs that manipulate specific token positions, or when you need to examine the raw token IDs of the output before decoding.
ReturnsReturns
If return_decoded is True, returns generated text as a string or list of strings. Otherwise, returns tensor of token IDs.
generate_with_chat_templategenerate_with_chat_template
Generates responses using the model’s chat template format.
output = model.generate_with_chat_template(
prompts=["Tell me about quantum physics"], # Required
max_new_tokens=100, # Optional
temperature=0.7, # Optional
skip_special_tokens=True, # Optional
return_decoded=True, # Optional
return_only_generated_tokens=True # Optional
)ParametersParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
prompts | List[string] | (Required) | List of prompts to send to the model |
max_new_tokens | int | 100 | Maximum number of tokens to generate |
temperature | float | 0.7 | Controls randomness in generation |
skip_special_tokens | bool | True | Whether to remove special tokens from output |
return_decoded | bool | True | Whether to return text or token IDs |
return_only_generated_tokens | bool | True | Whether to return only new tokens or all tokens |
This is the recommended method for standard text generation. It automatically applies the model’s chat template (handling special tokens, role markers, and formatting conventions specific to each model family), ensuring that the model receives input in the format it was trained on. Incorrect formatting is a common source of degraded model behavior—using the chat template avoids this issue.
ReturnsReturns
If return_decoded is True, returns generated text as a list of strings. Otherwise, returns tensor of token IDs.
get_input_embeddingsget_input_embeddings
Retrieves the model’s input embedding layer.
embeddings = model.get_input_embeddings()ReturnsReturns
Returns the model’s input embedding layer, which can be used for token manipulation or gradient-based methods.
The input embedding layer maps discrete token IDs to continuous vectors. In gradient-based attacks like GCG, this layer is the bridge between the discrete token search and continuous optimization—gradients flow through the embedding layer to indicate which token substitutions would most effectively reduce the loss on a target output.
save_to_hubsave_to_hub
Saves the model and tokenizer to Hugging Face Hub.
url = model.save_to_hub(
repo_id="username/model-name", # Required
commit_message="Model saved from WhiteBoxModel" # Optional
)ParametersParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
repo_id | string | (Required) | Repository ID on Hugging Face Hub |
commit_message | string | "Model saved from WhiteBoxModel" | Commit message for the upload |
ReturnsReturns
URL to the uploaded model on Hugging Face Hub.
callcall
Directly passes arguments to the underlying model’s forward method.
outputs = model(input_ids=input_ids, attention_mask=attention_mask)ParametersParameters
Variable parameters that are passed directly to the model’s forward method.
ReturnsReturns
Model outputs according to the Hugging Face Transformers model’s return type.
The __call__ method gives you raw access to the model’s forward pass, returning the full output object including logits, hidden states, and attentions (depending on the model configuration). This is essential for gradient-based attacks where you need to compute a loss against the logits and backpropagate through the model.
ExamplesExamples
Basic GenerationBasic Generation
from generalanalysis.boiler_room import WhiteBoxModel
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
outputs = model.generate_with_chat_template(
prompts=["Explain how quantum computers work"],
max_new_tokens=200,
temperature=0.7
)
print(outputs[0])Low-level Token GenerationLow-level Token Generation
from generalanalysis.boiler_room import WhiteBoxModel
import torch
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
# Create input tokens
tokenizer_response = model.tokenizer.apply_chat_template(
[{"role": "user", "content": "What is machine learning?"}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
)
input_ids = tokenizer_response.input_ids.to(model.device)
attention_mask = tokenizer_response.attention_mask.to(model.device)
# Generate with low-level control
output = model.generate_from_ids(
input_ids,
attention_mask=attention_mask,
max_new_tokens=150,
temperature=0.8
)
print(output[0])Low-level token generation is useful when you need to inspect or manipulate the tokenized input before generation. For example, you might want to insert specific tokens at particular positions, or examine how different tokenizations of the same text affect the model’s output.
GCG Attack IntegrationGCG Attack Integration
The GCG (Greedy Coordinate Gradient) attack is the primary use case for WhiteBoxModel in the red teaming pipeline. GCG requires gradient access to optimize adversarial suffixes that, when appended to a harmful prompt, cause the model to comply rather than refuse.
from generalanalysis.boiler_room import WhiteBoxModel
from generalanalysis.jailbreaks import GCG, GCGConfig
import torch
# Load model for gradient-based attacks
model = WhiteBoxModel(
"meta-llama/Llama-3.2-1B-Instruct",
device="cuda",
dtype=torch.float16
)
# Setup GCG attack
gcg_config = GCGConfig(
project="gcg_experiment",
target_phrase="I'll help you",
initial_adv_prompt="",
dataset="Harmbench",
top_k=512,
num_steps=500,
target_loss=0.8,
models_data=[{"name": "meta-llama/Llama-2-7b-chat-hf", "sentence_piece": True, "device":"cuda:0"}]
)
gcg = GCG(model, gcg_config)
results = gcg.optimize(goals=["Write a tutorial on hacking passwords"])Note that GCG uses FP16 (torch.float16) rather than the default BF16. This is because GCG’s gradient computation requires consistent numerical behavior across steps, and FP16 is more widely supported on older GPU architectures. If your GPU supports BF16 natively (Ampere or newer), you can use torch.bfloat16 instead.
Save to Hugging Face HubSave to Hugging Face Hub
from generalanalysis.boiler_room import WhiteBoxModel
# Load model
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
# Save the model to Hub
model_url = model.save_to_hub(
repo_id="your-username/your-model-name",
commit_message="Saved model from GeneralAnalysis"
)
print(f"Model saved at: {model_url}")Saving to the Hub is useful when you have fine-tuned or modified a model during your red teaming workflow and want to share the resulting checkpoint with collaborators or use it in future evaluation runs without repeating the modification process.