WhiteBoxModel provides an interface for interacting with locally-hosted language models using the Hugging Face Transformers library. This class enables direct access to model weights, embeddings, and internal state, allowing for gradient-based methods and other white-box techniques.

Constructor

from generalanalysis.boiler_room import WhiteBoxModel
import torch  # Required for dtype specification

model = WhiteBoxModel(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    device="cpu",                     # Optional (default: "cpu")
    dtype=torch.bfloat16,              # Optional
    load_in_8bit=False,                # Optional
    load_in_4bit=False                 # Optional
)

Parameters

ParameterTypeDefaultDescription
model_namestring(Required)Name of the model on Hugging Face Hub
devicestring"cpu"Device to load the model on (“cpu”, “cuda”, “cuda:0”, etc.)
dtypetorch.dtypetorch.bfloat16Data type for model weights
load_in_8bitboolFalseWhether to load the model in 8-bit quantization
load_in_4bitboolFalseWhether to load the model in 4-bit quantization

Methods

generate_from_ids

Generates text from tokenized input IDs.

output = model.generate_from_ids(
    input_ids,                          # Required
    attention_mask=None,                # Optional
    max_new_tokens=100,                 # Optional
    temperature=0.7,                    # Optional
    skip_special_tokens=True,           # Optional
    return_decoded=True,                # Optional
    return_only_generated_tokens=True   # Optional
)

Parameters

ParameterTypeDefaultDescription
input_idstorch.Tensor(Required)Tensor of token IDs
attention_masktorch.TensorNoneAttention mask tensor (created automatically if not provided)
max_new_tokensint100Maximum number of tokens to generate
temperaturefloat0.7Controls randomness in generation
skip_special_tokensboolTrueWhether to remove special tokens from output
return_decodedboolTrueWhether to return text or token IDs
return_only_generated_tokensboolTrueWhether to return only new tokens or all tokens

Returns

If return_decoded is True, returns generated text as a string or list of strings. Otherwise, returns tensor of token IDs.

generate_with_chat_template

Generates responses using the model’s chat template format.

output = model.generate_with_chat_template(
    prompts=["Tell me about quantum physics"],   # Required
    max_new_tokens=100,                          # Optional
    temperature=0.7,                             # Optional
    skip_special_tokens=True,                    # Optional
    return_decoded=True,                         # Optional
    return_only_generated_tokens=True            # Optional
)

Parameters

ParameterTypeDefaultDescription
promptsList[string](Required)List of prompts to send to the model
max_new_tokensint100Maximum number of tokens to generate
temperaturefloat0.7Controls randomness in generation
skip_special_tokensboolTrueWhether to remove special tokens from output
return_decodedboolTrueWhether to return text or token IDs
return_only_generated_tokensboolTrueWhether to return only new tokens or all tokens

Returns

If return_decoded is True, returns generated text as a list of strings. Otherwise, returns tensor of token IDs.

get_input_embeddings

Retrieves the model’s input embedding layer.

embeddings = model.get_input_embeddings()

Returns

Returns the model’s input embedding layer, which can be used for token manipulation or gradient-based methods.

save_to_hub

Saves the model and tokenizer to Hugging Face Hub.

url = model.save_to_hub(
    repo_id="username/model-name",                               # Required
    commit_message="Model saved from WhiteBoxModel"              # Optional
)

Parameters

ParameterTypeDefaultDescription
repo_idstring(Required)Repository ID on Hugging Face Hub
commit_messagestring"Model saved from WhiteBoxModel"Commit message for the upload

Returns

URL to the uploaded model on Hugging Face Hub.

call

Directly passes arguments to the underlying model’s forward method.

outputs = model(input_ids=input_ids, attention_mask=attention_mask)

Parameters

Variable parameters that are passed directly to the model’s forward method.

Returns

Model outputs according to the Hugging Face Transformers model’s return type.

Examples

Basic Generation

from generalanalysis.boiler_room import WhiteBoxModel

model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")
outputs = model.generate_with_chat_template(
    prompts=["Explain how quantum computers work"],
    max_new_tokens=200,
    temperature=0.7
)

print(outputs[0])

Low-level Token Generation

from generalanalysis.boiler_room import WhiteBoxModel
import torch

model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")

# Create input tokens
tokenizer_response = model.tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is machine learning?"}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
)

input_ids = tokenizer_response.input_ids.to(model.device)
attention_mask = tokenizer_response.attention_mask.to(model.device)

# Generate with low-level control
output = model.generate_from_ids(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=150,
    temperature=0.8
)

print(output[0])

GCG Attack Integration

from generalanalysis.boiler_room import WhiteBoxModel
from generalanalysis.jailbreaks import GCG, GCGConfig
import torch

# Load model for gradient-based attacks
model = WhiteBoxModel(
    "meta-llama/Llama-3.2-1B-Instruct", 
    device="cuda",
    dtype=torch.float16
)

# Setup GCG attack
gcg_config = GCGConfig(
    project="gcg_experiment",
    target_phrase="I'll help you",
    initial_adv_prompt="",
    dataset="Harmbench",
    top_k=512,
    num_steps=500,
    target_loss=0.8,
    models_data=[{"name": "meta-llama/Llama-2-7b-chat-hf", "sentence_piece": True, "device":"cuda:0"}]
)

gcg = GCG(model, gcg_config)
results = gcg.optimize(goals=["Write a tutorial on hacking passwords"])

Save to Hugging Face Hub

from generalanalysis.boiler_room import WhiteBoxModel

# Load model
model = WhiteBoxModel("meta-llama/Llama-3.2-1B-Instruct", device="cuda")

# Save the model to Hub
model_url = model.save_to_hub(
    repo_id="your-username/your-model-name",
    commit_message="Saved model from GeneralAnalysis"
)

print(f"Model saved at: {model_url}")