WhiteBoxModel provides an interface for interacting with locally-hosted language models using the Hugging Face Transformers library. This class enables direct access to model weights, embeddings, and internal state, allowing for gradient-based methods and other white-box techniques.
Constructor
from generalanalysis.boiler_room import WhiteBoxModel
import torch # Required for dtype specification
model = WhiteBoxModel(
model_name="meta-llama/Llama-3.2-1B-Instruct",
device="cpu", # Optional (default: "cpu")
dtype=torch.bfloat16, # Optional
load_in_8bit=False, # Optional
load_in_4bit=False # Optional
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | string | (Required) | Name of the model on Hugging Face Hub |
device | string | "cpu" | Device to load the model on (“cpu”, “cuda”, “cuda:0”, etc.) |
dtype | torch.dtype | torch.bfloat16 | Data type for model weights |
load_in_8bit | bool | False | Whether to load the model in 8-bit quantization |
load_in_4bit | bool | False | Whether to load the model in 4-bit quantization |
Methods
generate_from_ids
Generates text from tokenized input IDs.
output = model.generate_from_ids(
input_ids, # Required
attention_mask=None, # Optional
max_new_tokens=100, # Optional
temperature=0.7, # Optional
skip_special_tokens=True, # Optional
return_decoded=True, # Optional
return_only_generated_tokens=True # Optional
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_ids | torch.Tensor | (Required) | Tensor of token IDs |
attention_mask | torch.Tensor | None | Attention mask tensor (created automatically if not provided) |
max_new_tokens | int | 100 | Maximum number of tokens to generate |
temperature | float | 0.7 | Controls randomness in generation |
skip_special_tokens | bool | True | Whether to remove special tokens from output |
return_decoded | bool | True | Whether to return text or token IDs |
return_only_generated_tokens | bool | True | Whether to return only new tokens or all tokens |
Returns
If return_decoded is True, returns generated text as a string or list of strings. Otherwise, returns tensor of token IDs.
generate_with_chat_template
Generates responses using the model’s chat template format.
output = model.generate_with_chat_template(
prompts=["Tell me about quantum physics"], # Required
max_new_tokens=100, # Optional
temperature=0.7, # Optional
skip_special_tokens=True, # Optional
return_decoded=True, # Optional
return_only_generated_tokens=True # Optional
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
prompts | List[string] | (Required) | List of prompts to send to the model |
max_new_tokens | int | 100 | Maximum number of tokens to generate |
temperature | float | 0.7 | Controls randomness in generation |
skip_special_tokens | bool | True | Whether to remove special tokens from output |
return_decoded | bool | True | Whether to return text or token IDs |
return_only_generated_tokens | bool | True | Whether to return only new tokens or all tokens |
Returns
If return_decoded is True, returns generated text as a list of strings. Otherwise, returns tensor of token IDs.
get_input_embeddings
Retrieves the model’s input embedding layer.
embeddings = model.get_input_embeddings()Returns
Returns the model’s input embedding layer, which can be used for token manipulation or gradient-based methods.
save_to_hub
Saves the model and tokenizer to Hugging Face Hub.
url = model.save_to_hub(
repo_id="username/model-name", # Required
commit_message="Model saved from WhiteBoxModel" # Optional
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
repo_id | string | (Required) | Repository ID on Hugging Face Hub |
commit_message | string | "Model saved from WhiteBoxModel" | Commit message for the upload |
Returns
URL to the uploaded model on Hugging Face Hub.
call
Directly passes arguments to the underlying model’s forward method.
outputs = model(input_ids=input_ids, attention_mask=attention_mask)Parameters
Variable parameters that are passed directly to the model’s forward method.
Returns
Model outputs according to the Hugging Face Transformers model’s return type.
Examples
Basic Generation
from generalanalysis.boiler_room import WhiteBoxModel
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
outputs = model.generate_with_chat_template(
prompts=["Explain how quantum computers work"],
max_new_tokens=200,
temperature=0.7
)
print(outputs[0])Low-level Token Generation
from generalanalysis.boiler_room import WhiteBoxModel
import torch
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
# Create input tokens
tokenizer_response = model.tokenizer.apply_chat_template(
[{"role": "user", "content": "What is machine learning?"}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
)
input_ids = tokenizer_response.input_ids.to(model.device)
attention_mask = tokenizer_response.attention_mask.to(model.device)
# Generate with low-level control
output = model.generate_from_ids(
input_ids,
attention_mask=attention_mask,
max_new_tokens=150,
temperature=0.8
)
print(output[0])GCG Attack Integration
from generalanalysis.boiler_room import WhiteBoxModel
from generalanalysis.jailbreaks import GCG, GCGConfig
import torch
# Load model for gradient-based attacks
model = WhiteBoxModel(
"meta-llama/Llama-3.2-1B-Instruct",
device="cuda",
dtype=torch.float16
)
# Setup GCG attack
gcg_config = GCGConfig(
project="gcg_experiment",
target_phrase="I'll help you",
initial_adv_prompt="",
dataset="Harmbench",
top_k=512,
num_steps=500,
target_loss=0.8,
models_data=[{"name": "meta-llama/Llama-2-7b-chat-hf", "sentence_piece": True, "device":"cuda:0"}]
)
gcg = GCG(model, gcg_config)
results = gcg.optimize(goals=["Write a tutorial on hacking passwords"])Save to Hugging Face Hub
from generalanalysis.boiler_room import WhiteBoxModel
# Load model
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
# Save the model to Hub
model_url = model.save_to_hub(
repo_id="your-username/your-model-name",
commit_message="Saved model from GeneralAnalysis"
)
print(f"Model saved at: {model_url}")