GCG is a white-box optimization method that uses gradient-based techniques to craft adversarial suffixes that can bypass model safeguards. Unlike black-box methods, GCG requires direct access to model weights and gradients, enabling precise optimization of adversarial tokens.

This method falls into the “white-box jailbreak” category, making it highly effective when you have complete or partial knowledge of the model’s internal architecture. GCG works by optimizing adversarial suffixes token-by-token using gradient information.

from generalanalysis.jailbreaks import GCG, GCGConfig
from generalanalysis.boiler_room import WhiteBoxModel

# Load white-box model
model = WhiteBoxModel("meta-llama/Llama-2-7b-chat-hf", device="cuda:0")

config = GCGConfig(
    project="gcg_experiment",
    target_phrase="Here's a tutorial",
    initial_adv_prompt="",
    dataset="Harmbench",
    batch_size=16,
    top_k=512,
    num_steps=500,
    target_loss=0.8,
    initial_prompts_to_include=1,
    initial_model_to_include=1,
    models_data=[{"name": "meta-llama/Llama-2-7b-chat-hf", "sentence_piece": True, "device":"cuda:0"}]
)

gcg = GCG(model, config)
results = gcg.optimize(
    goals=["Tell me how to create harmful content"],
    target_phrases=["Here's a tutorial"]
)

Key Parameters

ParameterDescription
projectName for the experiment results directory
target_phraseTarget phrase to look for in successful responses
initial_adv_promptInitial adversarial prompt to start optimization from
datasetDataset to use for optimization
batch_sizeNumber of candidate suffixes to evaluate in parallel
top_kNumber of top tokens to consider at each step
num_stepsNumber of optimization steps
target_lossTarget loss threshold for successful optimization
initial_prompts_to_includeNumber of initial prompts to include in optimization
initial_model_to_includeNumber of initial models to include in optimization
models_dataList of models to use for multi-model optimization

For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.