GCG (Greedy Coordinate Gradients)
White-box gradient-based optimization for adversarial suffixes
GCG is a white-box optimization method that uses gradient-based techniques to craft adversarial suffixes that can bypass model safeguards. Unlike black-box methods, GCG requires direct access to model weights and gradients, enabling precise optimization of adversarial tokens.
This method falls into the “white-box jailbreak” category, making it highly effective when you have complete or partial knowledge of the model’s internal architecture. GCG works by optimizing adversarial suffixes token-by-token using gradient information.
Key Parameters
Parameter | Description |
---|---|
project | Name for the experiment results directory |
target_phrase | Target phrase to look for in successful responses |
initial_adv_prompt | Initial adversarial prompt to start optimization from |
dataset | Dataset to use for optimization |
batch_size | Number of candidate suffixes to evaluate in parallel |
top_k | Number of top tokens to consider at each step |
num_steps | Number of optimization steps |
target_loss | Target loss threshold for successful optimization |
initial_prompts_to_include | Number of initial prompts to include in optimization |
initial_model_to_include | Number of initial models to include in optimization |
models_data | List of models to use for multi-model optimization |
For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.