Runtime moderation and enforcement for AI
OmniGuard inspects what your AI reads and writes, then enforces your policy in real time. One multimodal model covers text, image, and audio across 50+ languages. Run it as a managed API or self-host it on your own GPUs.
OmniGuard is the enforcement product in the Realm Labs platform. Prism observes your AI in production; OmniGuard acts on it. This guide is everything you need to integrate and deploy OmniGuard yourself, whether you call our endpoint or run the container inside your own boundary.
A note on names. OmniGuard is the product. In deployment artifacts and API responses you will see the runtime referred to as realmguard: the Triton model name, the download script, and the reference client all use it. They are the same engine.
Choose how you deploy
OmniGuard ships in two forms. Both expose the same moderation API, so the integration you write today works whichever path you pick, and you can move from one to the other without rewriting your client.
Two deployment models
Same API surface · pick by where your data needs to liveOmniGuard API
Realm-hosted endpoint. Send content, get a verdict. No infrastructure to run.
- Fastest path to a working integration
- No GPUs, drivers, or Triton to manage
- Realm handles model updates and scaling
- Best for getting started and standard traffic
Self-hosted (Docker / Triton)
Download the model and run it on your own GPUs, in your VPC or fully air-gapped.
- No content ever leaves your boundary
- Runs on NVIDIA Triton on your H100s
- You control versioning and capacity
- Best for regulated data and on-prem stacks
Which one is right for you
| OmniGuard API | Self-hosted | |
|---|---|---|
| Where content is processed | Realm-managed endpoint | Inside your own VPC or on-prem cluster |
| Infrastructure you run | None | NVIDIA Triton on your GPUs |
| Data residency | Realm region | Your boundary, air-gap supported |
| Model updates | Managed for you | You pull versioned artifacts |
| Time to first call | Minutes | 15 to 20 minutes to download, then deploy |
| Best fit | Teams that want speed and no ops | Regulated, sensitive, or high-volume workloads |
Start managed, move on-prem later. Because the request and response contract is identical, many teams prototype against the API and switch to self-hosted for production without touching their integration code.
Quickstart
Moderate your first request in a few lines. Pick the tab that matches your deployment.
1. Send text for moderation
Post the content you want checked. The response tells you whether it was flagged and why.
curl -X POST https://api.realmlabs.ai/v1/moderations/text \ -H "Authorization: Bearer $REALM_API_KEY" \ -H "Content-Type: application/json" \ -d '{"text": "How do I reset a user password safely?"}'
2. Read the verdict
{
"flagged": false,
"score": 0.018,
"subcategory_flags": {},
"latency_ms": 21.4
}Image and audio use the same pattern at /v1/moderations/image and /v1/moderations/audio. See the REST moderation API for full request and response shapes.
Need an API key? Book a Demo and your Realm team will provision one for your account.
1. Download, launch, serve
Pull the model, start Triton, and bring up the moderation endpoints. The full walkthrough lives in Self-hosted deployment; here is the short version.
# 1. download the model artifacts (~15 min) export HF_TOKEN="<your-provisioned-token>" chmod +x download_realmguard.sh && ./download_realmguard.sh # 2. launch Triton with the realmguard model docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \ -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:24.11-py3 \ tritonserver --model-repository=/models --load-model=realmguard # 3. start the moderation client (opens /v1/moderations/* on :9000) python realmguard_api.py
2. Validate
curl -X POST http://localhost:9000/v1/moderations/text \ -H "Content-Type: application/json" \ -d '{"text": "How do I reset a user password safely?"}'
Self-hosting needs an NVIDIA GPU (H100 recommended) with driver 550+. Your account team provisions the download token and repository. See prerequisites.
Modalities
OmniGuard moderates three content types through a single model with a shared backbone. You select the modality per request.
| Modality | Field | What it checks |
|---|---|---|
| Text | text | Plain text: prompts, responses, captions, transcripts |
| Image | image | Base64-encoded image bytes, with an optional text caption for context |
| Audio | audio | Base64-encoded audio bytes, with an optional transcript for context |
One model for all three keeps deployment simple: a single artifact, a single endpoint, and consistent verdicts across whatever a user types, uploads, or says.
Input preprocessing
Before inference, OmniGuard runs every image and audio input through a standardized preprocessing pipeline. The steps normalize each input into a consistent representation regardless of its original format, resolution, sample rate, or encoding. Text needs no preprocessing of this kind and is moderated directly.
Preprocessing is deterministic. Identical content produces identical features. When the same logical file returns different moderation outcomes across submissions, the difference almost always comes from the source file itself (its encoding, compression artifacts, metadata, or decoder behavior), not from OmniGuard.
Image preprocessing
Every submitted image runs through three steps.
1. Decode and RGB normalization
The image is decoded into a standard 3-channel RGB representation. A few decoding behaviors are worth knowing, because they are the usual reason two "identical" images score differently.
2. Aspect-ratio preserving resize
If the longest edge is larger than 512 pixels, the image is downscaled with bicubic interpolation so the longest edge becomes exactly 512 pixels, preserving aspect ratio. If the longest edge is already 512 pixels or smaller, no resizing happens. The step is fully deterministic: identical input pixels at identical dimensions always produce identical output.
3. Model processor
The normalized RGB image is passed to the model's native image processor, which handles tensor conversion, pixel normalization, tiling and patch generation, and model-specific feature preparation. OmniGuard does not modify or override these model-native steps.
For reproducible image results: use a lossless format (PNG preferred), bake orientation into the pixels, avoid repeated JPEG re-encoding, and strip unnecessary metadata before upload.
Audio preprocessing
OmniGuard supports common audio containers and codecs, including MP3, WAV, FLAC, OGG, and M4A. Additional formats may work depending on the decoder stack. Every submitted file runs through four steps.
1. Decode, resample, and downmix
Audio is decoded into a waveform, resampled to 16 kHz regardless of the original sample rate, and downmixed to mono.
2. Length normalization
OmniGuard analyzes audio in a fixed 300-second (5-minute) window. Clips shorter than 300 seconds are padded with silence; clips longer than 300 seconds are truncated to the first 300 seconds. This is deterministic and gives every input an identical duration.
3. Log-mel spectrogram
The 16 kHz mono waveform is converted into a log-mel spectrogram with the parameters below, using standard log-magnitude scaling and normalization. No dithering or artificial noise is added, so identical waveforms always produce identical features.
| Parameter | Value |
|---|---|
| Mel-frequency bins | 128 |
| Analysis window | 25 ms |
| Hop size | 10 ms |
| Sample rate | 16 kHz, mono |
4. Model processing
The spectrogram features are passed directly to the moderation model. OmniGuard does not modify or override the model's native feature processing beyond the steps described here.
For reproducible audio results: submit lossless audio when you can, ideally 16 kHz mono WAV or FLAC. Lossy formats like MP3, AAC, and OGG permanently alter the waveform during compression, and different decoders can introduce small numerical differences. Converting WAV to MP3 and back will not reproduce the original waveform exactly, even when it sounds identical.
Request types
A request type tells OmniGuard which moderation workflow to run. It is set with the REQUEST_TYPE field and defaults to prompt when omitted.
| Request type | Moderates | Typical use |
|---|---|---|
prompt | User-provided content | Input safety, file uploads, voice assistants |
response | An LLM response, in the context of its prompt | Output filtering, inline generation blocking |
multiturn | A full chat conversation | Persistent chat and agentic workflows |
version | Nothing; returns deployed artifact hashes | Deployment verification and audit |
Prompt moderation is single-shot. Response moderation builds an internal system → user → assistant conversation and scores the final turn with a dedicated response-safety probe. Multi-turn moderation evaluates an entire history and is covered in Multi-turn moderation.
Reading a verdict
Every moderation request returns the same contract, so your handling code stays consistent across modalities and request types.
| Output | Type | Shape | Meaning |
|---|---|---|---|
SCORE | FP32 | [1] | Moderation probability for the content |
FLAGGED | BOOL | [1] | Threshold decision: did it cross your policy line |
SUBCATEGORY_SCORES | FP32 | [13] | Per-category probabilities across the taxonomy |
SUBCATEGORY_FLAGS | BOOL | [13] | Per-category threshold decisions |
CACHE_USAGE | INT32 | [8] | KV-cache statistics for multi-turn requests |
Subcategory outputs populate only when a request is flagged and the relevant subcategory probe is loaded, so a clean request stays cheap. Use SCORE when you want to set your own threshold, and FLAGGED when you want OmniGuard's default decision.
Moderation taxonomy
OmniGuard classifies content against the MLCommons safety taxonomy. The 13 categories map directly to the index positions in SUBCATEGORY_SCORES and SUBCATEGORY_FLAGS.
What OmniGuard catches
Beyond category classification, OmniGuard is tuned for four jobs that matter most in production AI.
These align to frameworks your auditors already recognize, including the EU AI Act, the OWASP Top 10 for LLM applications, and FINRA. See Compliance and security.
REST moderation API
The reference client exposes one endpoint per modality. This is the simplest way to integrate, and it is identical whether you call the managed API or your own self-hosted client.
Text
curl -X POST http://localhost:9000/v1/moderations/text \ -H "Content-Type: application/json" \ -d '{"text": "How to kill a child process in Linux?"}'
A genuinely benign request like this should come back clean. Good moderation reads intent, not keywords.
Image
curl -X POST http://localhost:9000/v1/moderations/image \ -F "file=@/path/to/image.png"
Audio
curl -X POST http://localhost:9000/v1/moderations/audio \ -F "file=@/path/to/audio.wav"
Processing .m4a audio needs the ffmpeg binary on the host running the client.
Unified inference API
Under the REST client sits a single Triton endpoint that multiplexes every moderation workflow. You will use this directly when you want batching, multi-turn caching, or tight control over the tensor contract. Two routing fields select the workflow.
prompt, response, multiturn, versiontext, image, audioInput schema
All inputs are optional at the tensor level and validated dynamically based on the selected request type.
| Input | Type | Shape | Description |
|---|---|---|---|
REQUEST_TYPE | STRING | [1] | prompt, response, multiturn, version |
MODALITY | STRING | [1] | text, image, audio |
TEXT | STRING | [1] | Text payload or caption / transcript |
IMAGE | STRING | [1] | Base64-encoded image bytes |
AUDIO | STRING | [1] | Base64-encoded audio bytes |
RESPONSE_TEXT | STRING | [1] | Assistant response to evaluate |
MESSAGES | STRING | [-1, 2] | Conversation turns as [role, content] |
SESSION_ID | STRING | [1] | KV-cache session identifier |
CACHE_WRITE | BOOL | [1] | Enables or disables cache writes |
Output schema
Every request returns the consistent verdict contract described in Reading a verdict: SCORE, FLAGGED, SUBCATEGORY_SCORES[13], SUBCATEGORY_FLAGS[13], and CACHE_USAGE[8].
Request type details
Prompt moderation
Set REQUEST_TYPE="prompt" for single-shot moderation of user-provided content. For text, supply MODALITY="text" and TEXT. For image, supply MODALITY="image" and IMAGE, with optional TEXT as a caption. For audio, supply MODALITY="audio" and AUDIO, with optional TEXT as a transcript.
Response moderation
Set REQUEST_TYPE="response" and provide RESPONSE_TEXT. OmniGuard internally constructs a system → user(prompt) → assistant(response) conversation and scores the final evaluation turn with a dedicated response-safety probe. Use it for AI output filtering, response governance, inline generation blocking, and agent or tool response validation.
Multi-turn moderation & KV cache
Set REQUEST_TYPE="multiturn" and pass the full conversation in MESSAGES. The moderation mode follows the final turn: if it is a user turn, OmniGuard applies prompt-style moderation; if it is an assistant turn, it applies response-style moderation.
[ ["system", "..."], ["user", "..."], ["assistant", "..."] ]
Supported roles are system, user, and assistant.
Stateful KV cache
Long conversations would otherwise re-encode the full history on every check. OmniGuard avoids that with an optional two-tier KV cache keyed by SESSION_ID.
| Cache tier | Location |
|---|---|
| L1 | GPU |
| L2 | CPU |
Always submit the full conversation history. OmniGuard determines incremental cache reuse from the SESSION_ID automatically. The CACHE_USAGE output reports message cache hits and misses, token cache hits and misses, cache writes, and L1 / L2 evictions.
Recommended pattern. For persistent chat and agentic workflows, send the whole history with a stable SESSION_ID and let the cache do the work. You get the safety of full-context evaluation without paying to re-encode it each turn.
Streaming moderation
Models generate responses token by token, but most moderation systems only see a finished output. That gap lets unsafe content reach a user before anything checks it. RealmGuard Stream closes the gap by inspecting partial output as it is generated, so you can truncate or block before the user sees a violation.
ModelStreamInfer)How the stream flows
The client opens a stream and sends newly generated text as it arrives. The server processes chunks asynchronously and emits verdicts on a configurable cadence. Each verdict reports the highest sequence number it has fully evaluated, so you always know the exact safe boundary.
Request fields (client to server)
Response fields (server to client)
Example exchange
The client streams output chunks. The server's verdict escalates as more text is processed, moving from clean, to caution, to flagged, then issues a final verdict once the stream closes.
// processed the opening, looks clean { "processed_until_seq": 2, "flagged": false, "needs_caution": false, "score": 0.61, "latency_ms": 14.2, "is_final": false } // more text in, risk rising { "processed_until_seq": 4, "flagged": false, "needs_caution": true, "score": 0.67, "latency_ms": 13.9, "is_final": false } // violation detected before it finished generating { "processed_until_seq": 5, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 15.1, "is_final": false } // final verdict after end_of_stream { "processed_until_seq": 6, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 82.6, "is_final": true }
Cadence and the safety boundary
Verdicts are emitted on a server-defined or client-configured cadence, for example every N milliseconds or N chunks. The server may lag behind generation; this is expected. The authoritative safety boundary is always processed_until_seq: only text up to that sequence has been evaluated. When end_of_stream=true arrives, the server finishes any buffered text and emits a final result with is_final=true.
Treat any stream error as terminal. Moderation decisions remain authoritative only up to the last acknowledged processed_until_seq. Retrying means starting a new stream. Error codes are listed in Error handling.
Version fingerprinting
Send REQUEST_TYPE="version" to get SHA-256 hashes of the deployed artifacts. This is how you verify exactly what is running, prove integrity, and track model versions for audit.
Returned artifacts may include the runtime model code, probe weights, subcategory classifiers, and configuration artifacts. In a regulated environment, capture these hashes alongside your deployment records so you can demonstrate which version produced a given verdict.
Error handling
Failures are isolated per request inside a Triton batch. One bad request returns a structured error without affecting its neighbors.
Common validation failures
| Error | Cause |
|---|---|
| Missing modality payload | Missing TEXT, IMAGE, or AUDIO |
| Invalid conversation structure | Malformed MESSAGES |
| Invalid role | Unsupported role value |
| Empty content | Blank text or message content |
| Unsupported audio type | Unsupported file extension |
| Sequence overflow | Input exceeds the model context window |
Streaming gRPC status codes
| Status | When it is returned |
|---|---|
INVALID_ARGUMENT | Malformed request stream: non-monotonic or duplicate seq, empty text_chunk, or invalid configuration |
FAILED_PRECONDITION | Inconsistent stream state, such as end of stream before any content or a stream reused after a terminal response |
RESOURCE_EXHAUSTED | Rate or concurrent-stream limits reached, input buffering over bounds, or sustained moderation lag beyond safety thresholds |
UNAVAILABLE | Moderation backend not ready, restarting, or shedding load |
INTERNAL | Unexpected server-side failure during inference or serialization |
CANCELLED | Client terminated the stream early or disconnected |
Partial results emitted before an error remain valid up to the last reported processed_until_seq.
Self-hosted deployment
This is the full walkthrough for running OmniGuard inside your own boundary on NVIDIA Triton. You download the model from the repository provisioned for your account, make it available to a Triton server, and bring up the moderation endpoints. Plan on about 15 to 20 minutes for the initial download.
Prerequisites
nvcr.io/nvidia/tritonserver:24.11-py3 or newer)OmniGuard ships in two inference modes. Mode 1 uses Flash Attention 2 and is recommended for faster inference. Mode 2 runs without Flash Attention 2. The download includes both variants.
Air-gapped by design. Once the artifacts are inside your environment, OmniGuard runs with no outbound dependency. The only access it assumes is the GPU through the NVIDIA driver. Nothing about the content you moderate leaves your boundary.
1 · Download the model
Set your provisioned token in the environment, then run the download script. It pulls both model variants, with and without Flash Attention 2, and assembles the Triton model named realmguard.
Set your token
export HF_TOKEN="<your-provisioned-token>"
Run the download script
chmod +x download_realmguard.sh && ./download_realmguard.sh
The script places the model files inside a new models directory. It uses huggingface-cli when available and falls back to wget otherwise.
The extracted model carries its own config.pbtxt, model.py, weights, and Python backend environment. It runs as an independent Triton model. Keep the files as-is; any pipeline that copies them must not alter their contents, or Triton may fail to load the model.
2 · Launch Triton
If you already run Triton, drop the realmguard model into your model repository and restart. To start a fresh server with Docker:
docker run --rm --gpus all \ -p8000:8000 -p8001:8001 -p8002:8002 \ -v $(pwd)/models:/models \ nvcr.io/nvidia/tritonserver:24.11-py3 \ tritonserver \ --model-repository=/models \ --model-control-mode=explicit \ --load-model=realmguard
Triton exposes inference on port 8000 (HTTP) and port 8001 (gRPC). A newer Triton image works too. The model uses a custom model.py with its own execution environment on the Python backend, and assumes only GPU access through the NVIDIA driver.
3 · Run the moderation client
The reference client, realmguard_api.py, implements the Triton client and opens the per-modality moderation endpoints over HTTP. It has minimal dependencies and can run on any machine, including CPU-only, as long as it can reach the Triton HTTP port.
tritonclient[all] fastapi uvicorn python-multipart
# opens /v1/moderations/{text,image,audio}
python realmguard_api.pyThe endpoints map one to one with the REST moderation API, so anything you build against the managed API works unchanged against your local client.
4 · Validate
Confirm the full path end to end with a request per modality.
# text curl -X POST http://localhost:9000/v1/moderations/text \ -H "Content-Type: application/json" \ -d '{"text": "How to kill a child process?"}' # image curl -X POST http://localhost:9000/v1/moderations/image \ -F "file=@/path/to/image.png" # audio curl -X POST http://localhost:9000/v1/moderations/audio \ -F "file=@/path/to/audio.wav"
A clean response on each confirms Triton is serving the model and the client is wired up correctly.
Streaming on-prem
Response streaming deploys the same way: a Triton-packaged model with a custom Python backend. The package includes config.pbtxt, model.py, the model weights, and the backend environment, and runs as an independent Triton model. Install the listed Python dependencies in your Triton environment, then use the Realm SDK or the API to open the streaming endpoint.
import asyncio, websockets, json async def test(): async with websockets.connect('ws://localhost:8000/v1/moderations/response-stream') as ws: await ws.send(json.dumps({ 'stream_id': 'test-123', 'seq': 0, 'prompt': 'Hello', 'response_chunk': 'Hi there', 'end_of_stream': False })) print(await ws.recv()) await ws.send(json.dumps({'stream_id': 'test-123', 'seq': 1, 'response_chunk': '', 'end_of_stream': True})) print(await ws.recv()) asyncio.run(test())
You can also drive the stream interactively with tools like websocat or wscat. Adjust the endpoint and port to match your deployment. Streaming requires the original input prompt to be available alongside the generated output.
Operational notes
.m4a files needs the ffmpeg binary on the client host.Performance & benchmarks
The numbers below are measured on H100 with Flash Attention 2. One model serves every modality, so you are not stitching together separate classifiers with different latency profiles.
Latency
| Workload | Latency |
|---|---|
| Text, under 100 tokens | 19 ms |
| Text, under 256 tokens | 27 ms |
| Text, under 512 tokens | 31 ms |
| Image moderation | ~140 to 170 ms |
| Audio moderation | ~100 to 130 ms |
Text moderation sits near 32 ms median and under 50 ms at p95. Image and audio scale predictably with input size.
Text moderation accuracy
F1 score by dataset, OmniGuard against two widely used open guard models.
| Dataset | OmniGuard | LlamaGuard 4 12B | Qwen3Guard 8B |
|---|---|---|---|
| PolyGuardPrompts | 85.8% | 65.6% | 87.7% |
| OpenAI Moderation | 77.83% | 74.0% | 79.80% |
| LMSYS / ToxicChat | 73.4% | 37.3% | 68.3% |
| Nemotron v3 | 83.42% | 67.8% | 84.9% |
OmniGuard averages roughly 80% F1 across these sets, about 19 points above LlamaGuard 4 and on par with Qwen3Guard, at a fraction of their latency.
Audio and image accuracy
| Audio dataset | F1 |
|---|---|
| Omnibench | 83.74% |
| AudioTrust | 74.17% |
| Image dataset | F1 |
|---|---|
| Custom dataset | 83.33% |
| UnsafeBench | 61.06% |
Prompt injection
| Dataset | Result |
|---|---|
| AgentDojo | 100% recall |
| Promptfoo | 95.98% recall |
| safeguard / prompt_injections | 99.39% F1 |
| Clean traffic (Salesforce / wikitext) | 0.10% false positive rate |
Near-total recall on injection attacks with a 0.10% false-positive rate on clean traffic, so detection does not come at the cost of blocking legitimate use.
Multilingual
OmniGuard classifies safety across 50+ languages rather than translating to English first. On the PolyGuard multilingual set it holds an overall prompt-moderation F1 near 0.85, with English, German, Spanish, Italian, and Russian among the strongest. Localization keeps accuracy high on lower-resource languages where English-only guards drop off sharply.
Compliance & security
OmniGuard is built to enforce Responsible AI policy and to give your auditors something concrete to point at.
Realm Labs is SOC 2 aligned. For details on certifications, data handling, and an architecture review for your environment, talk to your account team.
Ready to put OmniGuard in front of your AI?
Whether you start on the managed API or self-host on day one, your Realm team will provision access, share the deployment artifacts, and review the integration with you.
Questions on deployment, schemas, or benchmarks? Reach out at support@realmlabs.ai or through your Realm account team.