Public Guard Lineup
The GA Guard open models bring the same hardened safety stack we run for enterprise customers to every builder. Each guard is trained on policy-driven synthetic data, red-team traces, and iterative adversarial rounds so they stay reliable under paraphrases, translations, or other distribution shifts. All models natively support up to 256k-token contexts and plug directly into the GA Guard SDK. For a broader comparison of guardrail tools and how GA Guard stacks up against alternatives like NeMo Guardrails, Lakera, and cloud-provider filters, see Best AI guardrails in 2026 .
These are not general-purpose language models repurposed for moderation. Every checkpoint in the GA Guard series is a purpose-built classifier, architecturally optimized for the binary and multi-label classification tasks that content moderation requires. This specialization is what allows GA Guard to deliver faster inference, higher accuracy, and lower resource consumption compared to prompting a general-purpose LLM to act as a moderator.
Read the full launch breakdown in the GA Guard AI guardrails model series , then grab the checkpoints below to self-host or deploy through the GA platform.

Guard Core
Our default guardrail, up to 15x faster than cloud providers, balancing robustness and latency for most stacks.

Guard Lite
Up to 25x faster than cloud providers, with minimal hardware requirements, while still outperforming major cloud offerings.

Guard Thinking
Our best performing guard for high-risk domains, hardened with aggressive adversarial training.
Understanding the guard variantsUnderstanding the guard variants
Each guard in the public lineup targets a different point on the accuracy-latency spectrum. Choosing the right one depends on your application’s requirements for response time, detection precision, and the severity of consequences if harmful content slips through.
GA Guard CoreGA Guard Core
Guard Core is the default recommendation for most production workloads. It is built on a mid-size transformer architecture that has been distilled and fine-tuned specifically for multi-label safety classification. Core delivers evaluation latencies in the 20–35ms range while maintaining F1 scores that exceed major cloud moderation APIs by double-digit margins. Its training data includes a broad mix of policy-driven synthetic examples, real-world red-team captures, and adversarial augmentation across multiple languages and obfuscation techniques. Use Guard Core when you need a strong balance of speed and accuracy for real-time chat moderation, API gateway screening, or inline agent-loop evaluation.
GA Guard LiteGA Guard Lite
Guard Lite is a compact variant optimized for latency-critical and resource-constrained environments. It uses a smaller model architecture with aggressive quantization, achieving evaluation times of 10–20ms on standard hardware. Despite its size, Lite outperforms cloud-based moderation services from AWS, Azure, and Google Vertex AI on standard safety benchmarks. Guard Lite is the right choice for high-throughput batch processing pipelines, edge deployments where GPU memory is limited, or any scenario where you need to evaluate thousands of items per second with minimal infrastructure cost.
GA Guard ThinkingGA Guard Thinking
Guard Thinking is the most capable model in the lineup, designed for high-stakes moderation scenarios where accuracy is paramount and latency is secondary. It employs a chain-of-thought reasoning approach, internally generating an analysis of the content before producing its classification. This deliberative process increases latency to 500–700ms but yields the highest F1 scores in the series, particularly on adversarial and ambiguous content. Guard Thinking is ideal for regulated industries (finance, healthcare, legal), child safety applications, content that requires nuanced contextual understanding, and as a second-opinion escalation layer for borderline decisions from Core or Lite.
Benchmark the guardsBenchmark the guards
Replicate our benchmark runs or plug the models into your own evaluation pipeline using the public datasets:
- GA Long Context Bench stress-tests moderation across multi-hundred-kilobyte transcripts.
- GA Jailbreak Bench measures resilience to prompt-injection and jailbreak attempts.
Both datasets include evaluation scripts so you can reproduce our scoring or extend the suites with custom adversarial prompts.
How benchmarks were conductedHow benchmarks were conducted
All benchmark results were produced using standardized evaluation protocols to ensure fair comparisons across models. Each guard was evaluated on the same held-out test sets, with no overlap between training and evaluation data. We measured precision, recall, and F1 score at the default classification threshold for each policy category, then computed macro-averaged scores across all categories.
For the jailbreak benchmark, we tested against a diverse corpus of attack techniques including GCG-generated adversarial suffixes, AutoDAN mutations, multi-turn Crescendo attacks, bijection-encoded payloads, and manually crafted social engineering prompts. The long-context benchmark evaluates detection accuracy when harmful content is embedded at varying positions within documents ranging from 10k to 256k tokens, simulating real-world scenarios where malicious instructions are hidden deep within otherwise benign agent transcripts or retrieved documents.
Cloud provider baselines (OpenAI Moderation API, AWS Comprehend, Azure Content Safety, Google Vertex AI) were evaluated on the same test sets during the same time period to control for any model updates.
Policy taxonomyPolicy taxonomy
Each public guard returns granular policy labels aligned to widely adopted compliance anchors so you can trace moderation outcomes to audit controls:
- PII & IP – Detect personal data, secrets, and copyrighted material.
- Illicit Activities – Block operational guidance on crime, weapons, or illegal substances.
- Hate – Capture hateful or harassing content targeting protected classes.
- Sexual Content – Filter explicit, exploitative, or non-consensual material.
- Prompt Security – Stop jailbreaks, prompt injection, and secret-exfiltration attempts.
- Violence & Self-Harm – Intercept instructions, glorification, or graphic depictions of harm.
- Misinformation – Flag demonstrably false narratives in civic, health, or safety contexts.
Use the GA Guard SDK to combine policy scores, tune thresholds, or add your own allow/deny overrides while retaining the base taxonomy in your logs.
Deeper look at each policy categoryDeeper look at each policy category
PII & IP covers a wide range of personally identifiable information including names, email addresses, phone numbers, physical addresses, social security numbers, credit card numbers, bank account details, driver’s license numbers, and IP addresses. It also detects intellectual property leakage such as proprietary code snippets, copyrighted text, and trade secrets. This category is essential for compliance with GDPR, CCPA, HIPAA, and other data protection regulations.
Illicit Activities identifies content that provides actionable guidance for illegal actions, including drug manufacturing, weapons assembly, financial fraud, hacking instructions, and trafficking. The guard distinguishes between educational or journalistic discussion of these topics (which is typically allowed) and step-by-step operational instructions (which are blocked).
Hate detects content that targets individuals or groups based on protected characteristics including race, ethnicity, gender, sexual orientation, religion, disability, and national origin. The classifier is trained to recognize both overt hate speech and subtle coded language, dog whistles, and dehumanizing rhetoric.
Sexual Content identifies explicit sexual material, non-consensual content, content involving minors, and other sexually exploitative material. Thresholds can be tuned to distinguish between clinical/educational sexual health content and genuinely harmful material.
Prompt Security is specifically designed for the LLM threat landscape, detecting jailbreak attempts, prompt injection attacks, system prompt extraction, and attempts to manipulate the model into ignoring its safety guidelines. This category is trained on the latest attack techniques from the research community and real-world red-team campaigns.
Violence & Self-Harm covers graphic depictions of violence, instructions for causing physical harm, glorification of violent acts, and content that encourages or provides methods for self-harm or suicide. The guard is calibrated to allow legitimate news reporting and educational content while blocking genuinely harmful material.
Misinformation flags content that contains demonstrably false claims in high-impact domains such as public health, election integrity, climate science, and emergency safety. This category focuses on factual accuracy rather than opinion, targeting claims that could cause real-world harm if believed and acted upon.
Performance snapshotPerformance snapshot
- GA Guard Thinking tops public benchmarks with F1 scores up to 0.983 while keeping false positives low.
- GA Guard Lite delivers 25x faster latency than major cloud filters yet still beats them across OpenAI Moderation, WildGuard, and HarmBench.
- GA Guard Core outperforms AWS, Azure, and Vertex by double-digit F1 margins on jailbreak and harmful-content suites.
- All models support contexts up to 256k tokens, covering agent transcripts, documents, and tool logs without sharding.
What these numbers mean in practiceWhat these numbers mean in practice
An F1 score of 0.983 means that Guard Thinking correctly identifies and blocks harmful content with near-perfect precision and recall. In practical terms, for every 1,000 pieces of harmful content, Thinking catches approximately 983 while generating very few false alarms on legitimate content. This level of accuracy is critical in high-risk domains where both missed detections (false negatives) and unnecessary blocks (false positives) carry significant costs.
The latency advantage of Guard Lite and Guard Core translates directly into user experience and infrastructure savings. At 10–20ms per evaluation, Lite adds negligible overhead to your request pipeline — users will not perceive any delay. At scale, faster evaluations mean fewer compute resources and lower costs per million evaluations. The 256k-token context window eliminates the need for chunking strategies that introduce complexity and reduce accuracy when harmful content spans chunk boundaries.
Grab the checkpoints, run the evaluation suites, and drop the models into your GA SDK integration for fast, production-ready moderation.
Choosing the right guard for your use caseChoosing the right guard for your use case
| Consideration | Guard Core | Guard Lite | Guard Thinking |
|---|---|---|---|
| Latency budget | 20–35ms | 10–20ms | 500–700ms |
| Accuracy priority | High | Good | Highest |
| Hardware requirements | Standard GPU | Minimal | Standard GPU |
| Best for | General production use | High-throughput, edge | Regulated, high-risk |
| Deployment pattern | Inline real-time | Batch or edge | Escalation or batch |
Many teams adopt a tiered strategy: route all traffic through Guard Lite or Guard Core for fast initial screening, then escalate uncertain results (where violation_prob falls between 0.4 and 0.7) to Guard Thinking for a definitive classification. This approach maximizes throughput while reserving the highest accuracy for the cases that need it most.