🔥 Launching ClaudeCosts: self-hosted gateway for Claude Code, start free
OmniGuard · Product documentation

Runtime moderation and enforcement for AI

OmniGuard inspects what your AI reads and writes, then enforces your policy in real time. One multimodal model covers text, image, and audio across 50+ languages. Run it as a managed API or self-host it on your own GPUs.

~32 ms
Median text moderation on H100
3 modalities
Text, image, audio in one model
50+ languages
Localized safety classification
13 categories
MLCommons safety taxonomy

OmniGuard is the enforcement product in the Realm Labs platform. Prism observes your AI in production; OmniGuard acts on it. This guide is everything you need to integrate and deploy OmniGuard yourself, whether you call our endpoint or run the container inside your own boundary.

A note on names. OmniGuard is the product. In deployment artifacts and API responses you will see the runtime referred to as realmguard: the Triton model name, the download script, and the reference client all use it. They are the same engine.

Choose how you deploy

OmniGuard ships in two forms. Both expose the same moderation API, so the integration you write today works whichever path you pick, and you can move from one to the other without rewriting your client.

Two deployment models

Same API surface · pick by where your data needs to live
Option A

OmniGuard API

Realm-hosted endpoint. Send content, get a verdict. No infrastructure to run.

  • Fastest path to a working integration
  • No GPUs, drivers, or Triton to manage
  • Realm handles model updates and scaling
  • Best for getting started and standard traffic
Option B

Self-hosted (Docker / Triton)

Download the model and run it on your own GPUs, in your VPC or fully air-gapped.

  • No content ever leaves your boundary
  • Runs on NVIDIA Triton on your H100s
  • You control versioning and capacity
  • Best for regulated data and on-prem stacks
Self-host guideon-prem · air-gap

Which one is right for you

OmniGuard APISelf-hosted
Where content is processedRealm-managed endpointInside your own VPC or on-prem cluster
Infrastructure you runNoneNVIDIA Triton on your GPUs
Data residencyRealm regionYour boundary, air-gap supported
Model updatesManaged for youYou pull versioned artifacts
Time to first callMinutes15 to 20 minutes to download, then deploy
Best fitTeams that want speed and no opsRegulated, sensitive, or high-volume workloads

Start managed, move on-prem later. Because the request and response contract is identical, many teams prototype against the API and switch to self-hosted for production without touching their integration code.

Quickstart

Moderate your first request in a few lines. Pick the tab that matches your deployment.

1. Send text for moderation

Post the content you want checked. The response tells you whether it was flagged and why.

cURL
curl -X POST https://api.realmlabs.ai/v1/moderations/text \
  -H "Authorization: Bearer $REALM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "How do I reset a user password safely?"}'

2. Read the verdict

JSON
{
  "flagged": false,
  "score": 0.018,
  "subcategory_flags": {},
  "latency_ms": 21.4
}

Image and audio use the same pattern at /v1/moderations/image and /v1/moderations/audio. See the REST moderation API for full request and response shapes.

Need an API key? Book a Demo and your Realm team will provision one for your account.

1. Download, launch, serve

Pull the model, start Triton, and bring up the moderation endpoints. The full walkthrough lives in Self-hosted deployment; here is the short version.

bash
# 1. download the model artifacts (~15 min)
export HF_TOKEN="<your-provisioned-token>"
chmod +x download_realmguard.sh && ./download_realmguard.sh

# 2. launch Triton with the realmguard model
docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
  -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:24.11-py3 \
  tritonserver --model-repository=/models --load-model=realmguard

# 3. start the moderation client (opens /v1/moderations/* on :9000)
python realmguard_api.py

2. Validate

bash
curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How do I reset a user password safely?"}'

Self-hosting needs an NVIDIA GPU (H100 recommended) with driver 550+. Your account team provisions the download token and repository. See prerequisites.


Modalities

OmniGuard moderates three content types through a single model with a shared backbone. You select the modality per request.

ModalityFieldWhat it checks
TexttextPlain text: prompts, responses, captions, transcripts
ImageimageBase64-encoded image bytes, with an optional text caption for context
AudioaudioBase64-encoded audio bytes, with an optional transcript for context

One model for all three keeps deployment simple: a single artifact, a single endpoint, and consistent verdicts across whatever a user types, uploads, or says.

Input preprocessing

Before inference, OmniGuard runs every image and audio input through a standardized preprocessing pipeline. The steps normalize each input into a consistent representation regardless of its original format, resolution, sample rate, or encoding. Text needs no preprocessing of this kind and is moderated directly.

Preprocessing is deterministic. Identical content produces identical features. When the same logical file returns different moderation outcomes across submissions, the difference almost always comes from the source file itself (its encoding, compression artifacts, metadata, or decoder behavior), not from OmniGuard.

Image preprocessing

Every submitted image runs through three steps.

1. Decode and RGB normalization

The image is decoded into a standard 3-channel RGB representation. A few decoding behaviors are worth knowing, because they are the usual reason two "identical" images score differently.

Lossy compression
JPEG images are decoded exactly as stored. Because JPEG uses lossy techniques like DCT quantization and chroma subsampling, decoded pixels are generally not identical to a lossless source. Converting a PNG to JPEG, or re-saving a JPEG several times, shifts pixel values.
Transparency (alpha)
The alpha channel is removed with no compositing against a background. Transparent pixels keep their stored RGB values rather than being flattened against white, black, or any other color.
ICC color profiles
Embedded color profiles are not applied. Pixels are interpreted directly as stored, so two files with identical visual content but different profiles can decode to different RGB values.
EXIF orientation
Orientation metadata is not applied. Rotations and flips recorded in EXIF are ignored and the raw stored pixel arrangement is used. Bake orientation into the pixels before upload for consistent behavior.

2. Aspect-ratio preserving resize

If the longest edge is larger than 512 pixels, the image is downscaled with bicubic interpolation so the longest edge becomes exactly 512 pixels, preserving aspect ratio. If the longest edge is already 512 pixels or smaller, no resizing happens. The step is fully deterministic: identical input pixels at identical dimensions always produce identical output.

3. Model processor

The normalized RGB image is passed to the model's native image processor, which handles tensor conversion, pixel normalization, tiling and patch generation, and model-specific feature preparation. OmniGuard does not modify or override these model-native steps.

For reproducible image results: use a lossless format (PNG preferred), bake orientation into the pixels, avoid repeated JPEG re-encoding, and strip unnecessary metadata before upload.

Audio preprocessing

OmniGuard supports common audio containers and codecs, including MP3, WAV, FLAC, OGG, and M4A. Additional formats may work depending on the decoder stack. Every submitted file runs through four steps.

1. Decode, resample, and downmix

Audio is decoded into a waveform, resampled to 16 kHz regardless of the original sample rate, and downmixed to mono.

2. Length normalization

OmniGuard analyzes audio in a fixed 300-second (5-minute) window. Clips shorter than 300 seconds are padded with silence; clips longer than 300 seconds are truncated to the first 300 seconds. This is deterministic and gives every input an identical duration.

3. Log-mel spectrogram

The 16 kHz mono waveform is converted into a log-mel spectrogram with the parameters below, using standard log-magnitude scaling and normalization. No dithering or artificial noise is added, so identical waveforms always produce identical features.

ParameterValue
Mel-frequency bins128
Analysis window25 ms
Hop size10 ms
Sample rate16 kHz, mono

4. Model processing

The spectrogram features are passed directly to the moderation model. OmniGuard does not modify or override the model's native feature processing beyond the steps described here.

For reproducible audio results: submit lossless audio when you can, ideally 16 kHz mono WAV or FLAC. Lossy formats like MP3, AAC, and OGG permanently alter the waveform during compression, and different decoders can introduce small numerical differences. Converting WAV to MP3 and back will not reproduce the original waveform exactly, even when it sounds identical.

Request types

A request type tells OmniGuard which moderation workflow to run. It is set with the REQUEST_TYPE field and defaults to prompt when omitted.

Request typeModeratesTypical use
promptUser-provided contentInput safety, file uploads, voice assistants
responseAn LLM response, in the context of its promptOutput filtering, inline generation blocking
multiturnA full chat conversationPersistent chat and agentic workflows
versionNothing; returns deployed artifact hashesDeployment verification and audit

Prompt moderation is single-shot. Response moderation builds an internal system → user → assistant conversation and scores the final turn with a dedicated response-safety probe. Multi-turn moderation evaluates an entire history and is covered in Multi-turn moderation.

Reading a verdict

Every moderation request returns the same contract, so your handling code stays consistent across modalities and request types.

OutputTypeShapeMeaning
SCOREFP32[1]Moderation probability for the content
FLAGGEDBOOL[1]Threshold decision: did it cross your policy line
SUBCATEGORY_SCORESFP32[13]Per-category probabilities across the taxonomy
SUBCATEGORY_FLAGSBOOL[13]Per-category threshold decisions
CACHE_USAGEINT32[8]KV-cache statistics for multi-turn requests

Subcategory outputs populate only when a request is flagged and the relevant subcategory probe is loaded, so a clean request stays cheap. Use SCORE when you want to set your own threshold, and FLAGGED when you want OmniGuard's default decision.

Moderation taxonomy

OmniGuard classifies content against the MLCommons safety taxonomy. The 13 categories map directly to the index positions in SUBCATEGORY_SCORES and SUBCATEGORY_FLAGS.

0Violent Crimes
1Non-Violent Crimes
2Sex-Related Crimes
3Child Sexual Exploitation
4Defamation
5Specialized Advice
6Indiscriminate Weapons (CBRNE)
7Hate
8Privacy
9Intellectual Property
10Sexual Content
11Suicide & Self-Harm
12Cybersecurity

What OmniGuard catches

Beyond category classification, OmniGuard is tuned for four jobs that matter most in production AI.

Safety moderation
Universal harmfulness detection across the 13-category taxonomy, for prompts and responses.
Prompt injection
Detection of jailbreaks and injection attempts, including agentic tool-use attacks, with a very low false-positive rate on clean traffic.
CBRN
Detection and moderation of chemical, biological, radiological, and nuclear content.
PII
Classification of personally identifiable information in text, so you can redact or block before it leaks.

These align to frameworks your auditors already recognize, including the EU AI Act, the OWASP Top 10 for LLM applications, and FINRA. See Compliance and security.


REST moderation API

The reference client exposes one endpoint per modality. This is the simplest way to integrate, and it is identical whether you call the managed API or your own self-hosted client.

POST/v1/moderations/text
POST/v1/moderations/image
POST/v1/moderations/audio

Text

cURL
curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How to kill a child process in Linux?"}'

A genuinely benign request like this should come back clean. Good moderation reads intent, not keywords.

Image

cURL
curl -X POST http://localhost:9000/v1/moderations/image \
  -F "file=@/path/to/image.png"

Audio

cURL
curl -X POST http://localhost:9000/v1/moderations/audio \
  -F "file=@/path/to/audio.wav"

Processing .m4a audio needs the ffmpeg binary on the host running the client.

Unified inference API

Under the REST client sits a single Triton endpoint that multiplexes every moderation workflow. You will use this directly when you want batching, multi-turn caching, or tight control over the tensor contract. Two routing fields select the workflow.

REQUEST_TYPE
Selects the moderation workflow: prompt, response, multiturn, version
MODALITY
Selects the content modality: text, image, audio

Input schema

All inputs are optional at the tensor level and validated dynamically based on the selected request type.

InputTypeShapeDescription
REQUEST_TYPESTRING[1]prompt, response, multiturn, version
MODALITYSTRING[1]text, image, audio
TEXTSTRING[1]Text payload or caption / transcript
IMAGESTRING[1]Base64-encoded image bytes
AUDIOSTRING[1]Base64-encoded audio bytes
RESPONSE_TEXTSTRING[1]Assistant response to evaluate
MESSAGESSTRING[-1, 2]Conversation turns as [role, content]
SESSION_IDSTRING[1]KV-cache session identifier
CACHE_WRITEBOOL[1]Enables or disables cache writes

Output schema

Every request returns the consistent verdict contract described in Reading a verdict: SCORE, FLAGGED, SUBCATEGORY_SCORES[13], SUBCATEGORY_FLAGS[13], and CACHE_USAGE[8].

Request type details

Prompt moderation

Set REQUEST_TYPE="prompt" for single-shot moderation of user-provided content. For text, supply MODALITY="text" and TEXT. For image, supply MODALITY="image" and IMAGE, with optional TEXT as a caption. For audio, supply MODALITY="audio" and AUDIO, with optional TEXT as a transcript.

Response moderation

Set REQUEST_TYPE="response" and provide RESPONSE_TEXT. OmniGuard internally constructs a system → user(prompt) → assistant(response) conversation and scores the final evaluation turn with a dedicated response-safety probe. Use it for AI output filtering, response governance, inline generation blocking, and agent or tool response validation.

Multi-turn moderation & KV cache

Set REQUEST_TYPE="multiturn" and pass the full conversation in MESSAGES. The moderation mode follows the final turn: if it is a user turn, OmniGuard applies prompt-style moderation; if it is an assistant turn, it applies response-style moderation.

conversation format
[
  ["system", "..."],
  ["user", "..."],
  ["assistant", "..."]
]

Supported roles are system, user, and assistant.

Stateful KV cache

Long conversations would otherwise re-encode the full history on every check. OmniGuard avoids that with an optional two-tier KV cache keyed by SESSION_ID.

Cache tierLocation
L1GPU
L2CPU

Always submit the full conversation history. OmniGuard determines incremental cache reuse from the SESSION_ID automatically. The CACHE_USAGE output reports message cache hits and misses, token cache hits and misses, cache writes, and L1 / L2 evictions.

Recommended pattern. For persistent chat and agentic workflows, send the whole history with a stable SESSION_ID and let the cache do the work. You get the safety of full-context evaluation without paying to re-encode it each turn.

Streaming moderation

Models generate responses token by token, but most moderation systems only see a finished output. That gap lets unsafe content reach a user before anything checks it. RealmGuard Stream closes the gap by inspecting partial output as it is generated, so you can truncate or block before the user sees a violation.

Protocol
gRPC, bidirectional streaming (Triton native ModelStreamInfer)
Session
One gRPC stream per generation session
Version
v1-alpha
Scope
Moderates model output only; does not modify or generate text

How the stream flows

The client opens a stream and sends newly generated text as it arrives. The server processes chunks asynchronously and emits verdicts on a configurable cadence. Each verdict reports the highest sequence number it has fully evaluated, so you always know the exact safe boundary.

Request fields (client to server)

stream_id
Optional client-defined identifier for correlation
seq
Required, monotonically increasing sequence number for ordering
text_chunk
Required, the newly generated text since the last message
end_of_stream
Optional bool, signals no further text will be sent

Response fields (server to client)

processed_until_seq
Highest sequence number fully processed and evaluated
flagged
Unsafe content detected within the processed prefix
needs_caution
Elevated risk, not yet a hard violation
score
Optional confidence score for the processed prefix
latency_ms
Processing latency for the evaluated prefix
is_final
True on the definitive result after end of stream

Example exchange

The client streams output chunks. The server's verdict escalates as more text is processed, moving from clean, to caution, to flagged, then issues a final verdict once the stream closes.

server → client (verdicts on cadence)
// processed the opening, looks clean
{ "processed_until_seq": 2, "flagged": false, "needs_caution": false, "score": 0.61, "latency_ms": 14.2, "is_final": false }

// more text in, risk rising
{ "processed_until_seq": 4, "flagged": false, "needs_caution": true, "score": 0.67, "latency_ms": 13.9, "is_final": false }

// violation detected before it finished generating
{ "processed_until_seq": 5, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 15.1, "is_final": false }

// final verdict after end_of_stream
{ "processed_until_seq": 6, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 82.6, "is_final": true }

Cadence and the safety boundary

Verdicts are emitted on a server-defined or client-configured cadence, for example every N milliseconds or N chunks. The server may lag behind generation; this is expected. The authoritative safety boundary is always processed_until_seq: only text up to that sequence has been evaluated. When end_of_stream=true arrives, the server finishes any buffered text and emits a final result with is_final=true.

Treat any stream error as terminal. Moderation decisions remain authoritative only up to the last acknowledged processed_until_seq. Retrying means starting a new stream. Error codes are listed in Error handling.

Version fingerprinting

Send REQUEST_TYPE="version" to get SHA-256 hashes of the deployed artifacts. This is how you verify exactly what is running, prove integrity, and track model versions for audit.

Returned artifacts may include the runtime model code, probe weights, subcategory classifiers, and configuration artifacts. In a regulated environment, capture these hashes alongside your deployment records so you can demonstrate which version produced a given verdict.

Error handling

Failures are isolated per request inside a Triton batch. One bad request returns a structured error without affecting its neighbors.

Common validation failures

ErrorCause
Missing modality payloadMissing TEXT, IMAGE, or AUDIO
Invalid conversation structureMalformed MESSAGES
Invalid roleUnsupported role value
Empty contentBlank text or message content
Unsupported audio typeUnsupported file extension
Sequence overflowInput exceeds the model context window

Streaming gRPC status codes

StatusWhen it is returned
INVALID_ARGUMENTMalformed request stream: non-monotonic or duplicate seq, empty text_chunk, or invalid configuration
FAILED_PRECONDITIONInconsistent stream state, such as end of stream before any content or a stream reused after a terminal response
RESOURCE_EXHAUSTEDRate or concurrent-stream limits reached, input buffering over bounds, or sustained moderation lag beyond safety thresholds
UNAVAILABLEModeration backend not ready, restarting, or shedding load
INTERNALUnexpected server-side failure during inference or serialization
CANCELLEDClient terminated the stream early or disconnected

Partial results emitted before an error remain valid up to the last reported processed_until_seq.


Self-hosted deployment

This is the full walkthrough for running OmniGuard inside your own boundary on NVIDIA Triton. You download the model from the repository provisioned for your account, make it available to a Triton server, and bring up the moderation endpoints. Plan on about 15 to 20 minutes for the initial download.

Prerequisites

GPU
NVIDIA GPU with driver 550+. H100 recommended.
Server
NVIDIA Triton Inference Server (Docker image nvcr.io/nvidia/tritonserver:24.11-py3 or newer)
Client host
Any machine with network access to the Triton HTTP port. The client itself can run on CPU, including a Mac.
Credentials
The download token and repository provisioned for your account

OmniGuard ships in two inference modes. Mode 1 uses Flash Attention 2 and is recommended for faster inference. Mode 2 runs without Flash Attention 2. The download includes both variants.

Air-gapped by design. Once the artifacts are inside your environment, OmniGuard runs with no outbound dependency. The only access it assumes is the GPU through the NVIDIA driver. Nothing about the content you moderate leaves your boundary.

1 · Download the model

Set your provisioned token in the environment, then run the download script. It pulls both model variants, with and without Flash Attention 2, and assembles the Triton model named realmguard.

a

Set your token

bash
export HF_TOKEN="<your-provisioned-token>"
b

Run the download script

bash
chmod +x download_realmguard.sh && ./download_realmguard.sh

The script places the model files inside a new models directory. It uses huggingface-cli when available and falls back to wget otherwise.

The extracted model carries its own config.pbtxt, model.py, weights, and Python backend environment. It runs as an independent Triton model. Keep the files as-is; any pipeline that copies them must not alter their contents, or Triton may fail to load the model.

2 · Launch Triton

If you already run Triton, drop the realmguard model into your model repository and restart. To start a fresh server with Docker:

bash
docker run --rm --gpus all \
  -p8000:8000 -p8001:8001 -p8002:8002 \
  -v $(pwd)/models:/models \
  nvcr.io/nvidia/tritonserver:24.11-py3 \
  tritonserver \
    --model-repository=/models \
    --model-control-mode=explicit \
    --load-model=realmguard

Triton exposes inference on port 8000 (HTTP) and port 8001 (gRPC). A newer Triton image works too. The model uses a custom model.py with its own execution environment on the Python backend, and assumes only GPU access through the NVIDIA driver.

3 · Run the moderation client

The reference client, realmguard_api.py, implements the Triton client and opens the per-modality moderation endpoints over HTTP. It has minimal dependencies and can run on any machine, including CPU-only, as long as it can reach the Triton HTTP port.

requirements.txt
tritonclient[all]
fastapi
uvicorn
python-multipart
bash
# opens /v1/moderations/{text,image,audio}
python realmguard_api.py

The endpoints map one to one with the REST moderation API, so anything you build against the managed API works unchanged against your local client.

4 · Validate

Confirm the full path end to end with a request per modality.

bash
# text
curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How to kill a child process?"}'

# image
curl -X POST http://localhost:9000/v1/moderations/image \
  -F "file=@/path/to/image.png"

# audio
curl -X POST http://localhost:9000/v1/moderations/audio \
  -F "file=@/path/to/audio.wav"

A clean response on each confirms Triton is serving the model and the client is wired up correctly.

Streaming on-prem

Response streaming deploys the same way: a Triton-packaged model with a custom Python backend. The package includes config.pbtxt, model.py, the model weights, and the backend environment, and runs as an independent Triton model. Install the listed Python dependencies in your Triton environment, then use the Realm SDK or the API to open the streaming endpoint.

python · quick stream test
import asyncio, websockets, json

async def test():
    async with websockets.connect('ws://localhost:8000/v1/moderations/response-stream') as ws:
        await ws.send(json.dumps({
            'stream_id': 'test-123', 'seq': 0,
            'prompt': 'Hello', 'response_chunk': 'Hi there',
            'end_of_stream': False
        }))
        print(await ws.recv())
        await ws.send(json.dumps({'stream_id': 'test-123', 'seq': 1, 'response_chunk': '', 'end_of_stream': True}))
        print(await ws.recv())

asyncio.run(test())

You can also drive the stream interactively with tools like websocat or wscat. Adjust the endpoint and port to match your deployment. Streaming requires the original input prompt to be available alongside the generated output.

Operational notes

Audio codecs
Processing .m4a files needs the ffmpeg binary on the client host.
Client placement
The client only needs the Triton HTTP port (default 8000). Run it wherever is convenient, including on a laptop during testing.
Artifact integrity
Keep the extracted model files unchanged. Pipelines should present them as-is.
Audit
Use version fingerprinting to record exactly which artifacts are live.

Performance & benchmarks

The numbers below are measured on H100 with Flash Attention 2. One model serves every modality, so you are not stitching together separate classifiers with different latency profiles.

Latency

WorkloadLatency
Text, under 100 tokens19 ms
Text, under 256 tokens27 ms
Text, under 512 tokens31 ms
Image moderation~140 to 170 ms
Audio moderation~100 to 130 ms

Text moderation sits near 32 ms median and under 50 ms at p95. Image and audio scale predictably with input size.

Text moderation accuracy

F1 score by dataset, OmniGuard against two widely used open guard models.

DatasetOmniGuardLlamaGuard 4 12BQwen3Guard 8B
PolyGuardPrompts85.8%65.6%87.7%
OpenAI Moderation77.83%74.0%79.80%
LMSYS / ToxicChat73.4%37.3%68.3%
Nemotron v383.42%67.8%84.9%

OmniGuard averages roughly 80% F1 across these sets, about 19 points above LlamaGuard 4 and on par with Qwen3Guard, at a fraction of their latency.

Audio and image accuracy

Audio datasetF1
Omnibench83.74%
AudioTrust74.17%
Image datasetF1
Custom dataset83.33%
UnsafeBench61.06%

Prompt injection

DatasetResult
AgentDojo100% recall
Promptfoo95.98% recall
safeguard / prompt_injections99.39% F1
Clean traffic (Salesforce / wikitext)0.10% false positive rate

Near-total recall on injection attacks with a 0.10% false-positive rate on clean traffic, so detection does not come at the cost of blocking legitimate use.

Multilingual

OmniGuard classifies safety across 50+ languages rather than translating to English first. On the PolyGuard multilingual set it holds an overall prompt-moderation F1 near 0.85, with English, German, Spanish, Italian, and Russian among the strongest. Localization keeps accuracy high on lower-resource languages where English-only guards drop off sharply.

Compliance & security

OmniGuard is built to enforce Responsible AI policy and to give your auditors something concrete to point at.

Regulatory alignment
Detection mapped to the EU AI Act, the OWASP Top 10 for LLM applications, and FINRA. Categories follow the MLCommons safety taxonomy.
Custom policy
Tune thresholds and category behavior to your organization's rules rather than a fixed global setting.
Data residency
Self-host in your VPC, your on-prem cluster, or fully air-gapped. With self-hosting, no content leaves your boundary.
Auditability
SHA-256 artifact fingerprinting records exactly which model version produced each verdict.
Real-time enforcement
Beyond detection, OmniGuard supports inline actions: block, redact, and reroute, on prompts and on responses as they generate.

Realm Labs is SOC 2 aligned. For details on certifications, data handling, and an architecture review for your environment, talk to your account team.

Ready to put OmniGuard in front of your AI?

Whether you start on the managed API or self-host on day one, your Realm team will provision access, share the deployment artifacts, and review the integration with you.

Questions on deployment, schemas, or benchmarks? Reach out at support@realmlabs.ai or through your Realm account team.