Bijection Learning
Encoding-based jailbreak using in-context learning
Bijection Learning leverages in-context learning to teach models a custom encoding scheme, then uses this encoding to bypass content filters. This technique exploits the model’s ability to learn and apply a mapping between plain English and an obfuscated symbolic representation within the same context.
The method works by defining an invertible mapping (bijection) between natural language and an encoded representation. By teaching the model this encoding scheme through examples and then posing harmful queries in the encoded format, the technique can effectively bypass safety filters that operate on natural language patterns.
Key Parameters
Parameter | Description |
---|---|
exp_name | Name for the experiment results |
victim_model | Target model to jailbreak |
trials | Attack budget (number of attempts) |
bijection_type | Type of encoding (“digit”, “letter”, “word”) |
fixed_size | Number of fixed points in the encoding |
num_digits | Number of digits for digit-based encoding |
safety_data | Dataset to use for testing |
universal | Whether to use a universal attack |
digit_delimiter | Delimiter for digit-based encoding |
interleave_practice | Whether to interleave practice examples |
context_length_search | Whether to search for optimal context length |
prefill | Whether to prefill the victim model |
num_teaching_shots | Number of examples to teach the encoding |
num_multi_turns | Number of practice conversation turns |
input_output_filter | Whether to use input/output filtering |
For detailed performance metrics and configurations, refer to our Jailbreak Cookbook.