OmniGuard · Product documentation

Runtime moderation and enforcement for AI

OmniGuard inspects what your AI reads and writes, then enforces your policy in real time. One multimodal model covers text, image, and audio across 50+ languages. Run it as a managed API or self-host it on your own GPUs.

~32 ms

Median text moderation on H100

3 modalities

Text, image, audio in one model

50+ languages

Localized safety classification

13 categories

MLCommons safety taxonomy

OmniGuard is the enforcement product in the Realm Labs platform. Prism observes your AI in production; OmniGuard acts on it. This guide is everything you need to integrate and deploy OmniGuard yourself, whether you call our endpoint or run the container inside your own boundary.

A note on names. OmniGuard is the product. In deployment artifacts and API responses you will see the runtime referred to as realmguard: the Triton model name, the download script, and the reference client all use it. They are the same engine.

Choose how you deploy

OmniGuard ships in two forms. Both expose the same moderation API, so the integration you write today works whichever path you pick, and you can move from one to the other without rewriting your client.

Two deployment models

Same API surface · pick by where your data needs to live

Option A

OmniGuard API

Realm-hosted endpoint. Send content, get a verdict. No infrastructure to run.

Fastest path to a working integration
No GPUs, drivers, or Triton to manage
Realm handles model updates and scaling
Best for getting started and standard traffic

API quickstartmanaged

Option B

Self-hosted (Docker / Triton)

Download the model and run it on your own GPUs, in your VPC or fully air-gapped.

No content ever leaves your boundary
Runs on NVIDIA Triton on your H100s
You control versioning and capacity
Best for regulated data and on-prem stacks

Self-host guideon-prem · air-gap

Which one is right for you

	OmniGuard API	Self-hosted
Where content is processed	Realm-managed endpoint	Inside your own VPC or on-prem cluster
Infrastructure you run	None	NVIDIA Triton on your GPUs
Data residency	Realm region	Your boundary, air-gap supported
Model updates	Managed for you	You pull versioned artifacts
Time to first call	Minutes	15 to 20 minutes to download, then deploy
Best fit	Teams that want speed and no ops	Regulated, sensitive, or high-volume workloads

Start managed, move on-prem later. Because the request and response contract is identical, many teams prototype against the API and switch to self-hosted for production without touching their integration code.

Quickstart

Moderate your first request in a few lines. Pick the tab that matches your deployment.

1. Send text for moderation

Post the content you want checked. The response tells you whether it was flagged and why.

cURL

curl -X POST https://api.realmlabs.ai/v1/moderations/text \
  -H "Authorization: Bearer $REALM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "How do I reset a user password safely?"}'

2. Read the verdict

JSON

{
  "flagged": false,
  "score": 0.018,
  "subcategory_flags": {},
  "latency_ms": 21.4
}

Image and audio use the same pattern at /v1/moderations/image and /v1/moderations/audio. See the REST moderation API for full request and response shapes.

Need an API key? Book a Demo and your Realm team will provision one for your account.

1. Download, launch, serve

Pull the model, start Triton, and bring up the moderation endpoints. The full walkthrough lives in Self-hosted deployment; here is the short version.

bash

# 1. download the model artifacts (~15 min)
export HF_TOKEN="<your-provisioned-token>"
chmod +x download_realmguard.sh && ./download_realmguard.sh

# 2. launch Triton with the realmguard model
docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
  -v $(pwd)/models:/models nvcr.io/nvidia/tritonserver:24.11-py3 \
  tritonserver --model-repository=/models --load-model=realmguard

# 3. start the moderation client (opens /v1/moderations/* on :9000)
python realmguard_api.py

2. Validate

bash

curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How do I reset a user password safely?"}'

Self-hosting needs an NVIDIA GPU (H100 recommended) with driver 550+. Your account team provisions the download token and repository. See prerequisites.

Modalities

OmniGuard moderates three content types through a single model with a shared backbone. You select the modality per request.

Modality	Field	What it checks
Text	`text`	Plain text: prompts, responses, captions, transcripts
Image	`image`	Base64-encoded image bytes, with an optional text caption for context
Audio	`audio`	Base64-encoded audio bytes, with an optional transcript for context

One model for all three keeps deployment simple: a single artifact, a single endpoint, and consistent verdicts across whatever a user types, uploads, or says.

Input preprocessing

Before inference, OmniGuard runs every image and audio input through a standardized preprocessing pipeline. The steps normalize each input into a consistent representation regardless of its original format, resolution, sample rate, or encoding. Text needs no preprocessing of this kind and is moderated directly.

Preprocessing is deterministic. Identical content produces identical features. When the same logical file returns different moderation outcomes across submissions, the difference almost always comes from the source file itself (its encoding, compression artifacts, metadata, or decoder behavior), not from OmniGuard.

Image preprocessing

Every submitted image runs through three steps.

1. Decode and RGB normalization

The image is decoded into a standard 3-channel RGB representation. A few decoding behaviors are worth knowing, because they are the usual reason two "identical" images score differently.

Lossy compression

JPEG images are decoded exactly as stored. Because JPEG uses lossy techniques like DCT quantization and chroma subsampling, decoded pixels are generally not identical to a lossless source. Converting a PNG to JPEG, or re-saving a JPEG several times, shifts pixel values.

Transparency (alpha)

The alpha channel is removed with no compositing against a background. Transparent pixels keep their stored RGB values rather than being flattened against white, black, or any other color.

ICC color profiles

Embedded color profiles are not applied. Pixels are interpreted directly as stored, so two files with identical visual content but different profiles can decode to different RGB values.

EXIF orientation

Orientation metadata is not applied. Rotations and flips recorded in EXIF are ignored and the raw stored pixel arrangement is used. Bake orientation into the pixels before upload for consistent behavior.

2. Aspect-ratio preserving resize

If the longest edge is larger than 512 pixels, the image is downscaled with bicubic interpolation so the longest edge becomes exactly 512 pixels, preserving aspect ratio. If the longest edge is already 512 pixels or smaller, no resizing happens. The step is fully deterministic: identical input pixels at identical dimensions always produce identical output.

3. Model processor

The normalized RGB image is passed to the model's native image processor, which handles tensor conversion, pixel normalization, tiling and patch generation, and model-specific feature preparation. OmniGuard does not modify or override these model-native steps.

For reproducible image results: use a lossless format (PNG preferred), bake orientation into the pixels, avoid repeated JPEG re-encoding, and strip unnecessary metadata before upload.

Audio preprocessing

OmniGuard supports common audio containers and codecs, including MP3, WAV, FLAC, OGG, and M4A. Additional formats may work depending on the decoder stack. Every submitted file runs through four steps.

1. Decode, resample, and downmix

Audio is decoded into a waveform, resampled to 16 kHz regardless of the original sample rate, and downmixed to mono.

2. Length normalization

OmniGuard analyzes audio in a fixed 300-second (5-minute) window. Clips shorter than 300 seconds are padded with silence; clips longer than 300 seconds are truncated to the first 300 seconds. This is deterministic and gives every input an identical duration.

3. Log-mel spectrogram

The 16 kHz mono waveform is converted into a log-mel spectrogram with the parameters below, using standard log-magnitude scaling and normalization. No dithering or artificial noise is added, so identical waveforms always produce identical features.

Parameter	Value
Mel-frequency bins	128
Analysis window	25 ms
Hop size	10 ms
Sample rate	16 kHz, mono

4. Model processing

The spectrogram features are passed directly to the moderation model. OmniGuard does not modify or override the model's native feature processing beyond the steps described here.

For reproducible audio results: submit lossless audio when you can, ideally 16 kHz mono WAV or FLAC. Lossy formats like MP3, AAC, and OGG permanently alter the waveform during compression, and different decoders can introduce small numerical differences. Converting WAV to MP3 and back will not reproduce the original waveform exactly, even when it sounds identical.

Request types

A request type tells OmniGuard which moderation workflow to run. It is set with the REQUEST_TYPE field and defaults to prompt when omitted.

Request type	Moderates	Typical use
`prompt`	User-provided content	Input safety, file uploads, voice assistants
`response`	An LLM response, in the context of its prompt	Output filtering, inline generation blocking
`multiturn`	A full chat conversation	Persistent chat and agentic workflows
`version`	Nothing; returns deployed artifact hashes	Deployment verification and audit

Prompt moderation is single-shot. Response moderation builds an internal system → user → assistant conversation and scores the final turn with a dedicated response-safety probe. Multi-turn moderation evaluates an entire history and is covered in Multi-turn moderation.

Reading a verdict

Every moderation request returns the same contract, so your handling code stays consistent across modalities and request types.

Output	Type	Shape	Meaning
`SCORE`	FP32	[1]	Moderation probability for the content
`FLAGGED`	BOOL	[1]	Threshold decision: did it cross your policy line
`SUBCATEGORY_SCORES`	FP32	[13]	Per-category probabilities across the taxonomy
`SUBCATEGORY_FLAGS`	BOOL	[13]	Per-category threshold decisions
`CACHE_USAGE`	INT32	[8]	KV-cache statistics for multi-turn requests

Subcategory outputs populate only when a request is flagged and the relevant subcategory probe is loaded, so a clean request stays cheap. Use SCORE when you want to set your own threshold, and FLAGGED when you want OmniGuard's default decision.

Moderation taxonomy

OmniGuard classifies content against the MLCommons safety taxonomy. The 13 categories map directly to the index positions in SUBCATEGORY_SCORES and SUBCATEGORY_FLAGS.

0Violent Crimes

1Non-Violent Crimes

2Sex-Related Crimes

3Child Sexual Exploitation

4Defamation

5Specialized Advice

6Indiscriminate Weapons (CBRNE)

7Hate

8Privacy

9Intellectual Property

10Sexual Content

11Suicide & Self-Harm

12Cybersecurity

What OmniGuard catches

Beyond category classification, OmniGuard is tuned for four jobs that matter most in production AI.

Safety moderation

Universal harmfulness detection across the 13-category taxonomy, for prompts and responses.

Prompt injection

Detection of jailbreaks and injection attempts, including agentic tool-use attacks, with a very low false-positive rate on clean traffic.

CBRN

Detection and moderation of chemical, biological, radiological, and nuclear content.

PII

Classification of personally identifiable information in text, so you can redact or block before it leaks.

These align to frameworks your auditors already recognize, including the EU AI Act, the OWASP Top 10 for LLM applications, and FINRA. See Compliance and security.

REST moderation API

The reference client exposes one endpoint per modality. This is the simplest way to integrate, and it is identical whether you call the managed API or your own self-hosted client.

POST/v1/moderations/text

POST/v1/moderations/image

POST/v1/moderations/audio

Text

cURL

curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How to kill a child process in Linux?"}'

A genuinely benign request like this should come back clean. Good moderation reads intent, not keywords.

Image

cURL

curl -X POST http://localhost:9000/v1/moderations/image \
  -F "file=@/path/to/image.png"

Audio

cURL

curl -X POST http://localhost:9000/v1/moderations/audio \
  -F "file=@/path/to/audio.wav"

Processing .m4a audio needs the ffmpeg binary on the host running the client.

Unified inference API

Under the REST client sits a single Triton endpoint that multiplexes every moderation workflow. You will use this directly when you want batching, multi-turn caching, or tight control over the tensor contract. Two routing fields select the workflow.

REQUEST_TYPE

Selects the moderation workflow: prompt, response, multiturn, version

MODALITY

Selects the content modality: text, image, audio

Input schema

All inputs are optional at the tensor level and validated dynamically based on the selected request type.

Input	Type	Shape	Description
`REQUEST_TYPE`	STRING	[1]	prompt, response, multiturn, version
`MODALITY`	STRING	[1]	text, image, audio
`TEXT`	STRING	[1]	Text payload or caption / transcript
`IMAGE`	STRING	[1]	Base64-encoded image bytes
`AUDIO`	STRING	[1]	Base64-encoded audio bytes
`RESPONSE_TEXT`	STRING	[1]	Assistant response to evaluate
`MESSAGES`	STRING	[-1, 2]	Conversation turns as [role, content]
`SESSION_ID`	STRING	[1]	KV-cache session identifier
`CACHE_WRITE`	BOOL	[1]	Enables or disables cache writes

Output schema

Every request returns the consistent verdict contract described in Reading a verdict: SCORE, FLAGGED, SUBCATEGORY_SCORES[13], SUBCATEGORY_FLAGS[13], and CACHE_USAGE[8].

Request type details

Prompt moderation

Set REQUEST_TYPE="prompt" for single-shot moderation of user-provided content. For text, supply MODALITY="text" and TEXT. For image, supply MODALITY="image" and IMAGE, with optional TEXT as a caption. For audio, supply MODALITY="audio" and AUDIO, with optional TEXT as a transcript.

Response moderation

Set REQUEST_TYPE="response" and provide RESPONSE_TEXT. OmniGuard internally constructs a system → user(prompt) → assistant(response) conversation and scores the final evaluation turn with a dedicated response-safety probe. Use it for AI output filtering, response governance, inline generation blocking, and agent or tool response validation.

Multi-turn moderation & KV cache

Set REQUEST_TYPE="multiturn" and pass the full conversation in MESSAGES. The moderation mode follows the final turn: if it is a user turn, OmniGuard applies prompt-style moderation; if it is an assistant turn, it applies response-style moderation.

conversation format

[
  ["system", "..."],
  ["user", "..."],
  ["assistant", "..."]
]

Supported roles are system, user, and assistant.

Stateful KV cache

Long conversations would otherwise re-encode the full history on every check. OmniGuard avoids that with an optional two-tier KV cache keyed by SESSION_ID.

Cache tier	Location
L1	GPU
L2	CPU

Always submit the full conversation history. OmniGuard determines incremental cache reuse from the SESSION_ID automatically. The CACHE_USAGE output reports message cache hits and misses, token cache hits and misses, cache writes, and L1 / L2 evictions.

Recommended pattern. For persistent chat and agentic workflows, send the whole history with a stable SESSION_ID and let the cache do the work. You get the safety of full-context evaluation without paying to re-encode it each turn.

Streaming moderation

Models generate responses token by token, but most moderation systems only see a finished output. That gap lets unsafe content reach a user before anything checks it. RealmGuard Stream closes the gap by inspecting partial output as it is generated, so you can truncate or block before the user sees a violation.

Protocol

gRPC, bidirectional streaming (Triton native ModelStreamInfer)

Session

One gRPC stream per generation session

Version

v1-alpha

Scope

Moderates model output only; does not modify or generate text

How the stream flows

The client opens a stream and sends newly generated text as it arrives. The server processes chunks asynchronously and emits verdicts on a configurable cadence. Each verdict reports the highest sequence number it has fully evaluated, so you always know the exact safe boundary.

Request fields (client to server)

stream_id

Optional client-defined identifier for correlation

seq

Required, monotonically increasing sequence number for ordering

text_chunk

Required, the newly generated text since the last message

end_of_stream

Optional bool, signals no further text will be sent

Response fields (server to client)

processed_until_seq

Highest sequence number fully processed and evaluated

flagged

Unsafe content detected within the processed prefix

needs_caution

Elevated risk, not yet a hard violation

score

Optional confidence score for the processed prefix

latency_ms

Processing latency for the evaluated prefix

is_final

True on the definitive result after end of stream

Example exchange

The client streams output chunks. The server's verdict escalates as more text is processed, moving from clean, to caution, to flagged, then issues a final verdict once the stream closes.

server → client (verdicts on cadence)

// processed the opening, looks clean
{ "processed_until_seq": 2, "flagged": false, "needs_caution": false, "score": 0.61, "latency_ms": 14.2, "is_final": false }

// more text in, risk rising
{ "processed_until_seq": 4, "flagged": false, "needs_caution": true, "score": 0.67, "latency_ms": 13.9, "is_final": false }

// violation detected before it finished generating
{ "processed_until_seq": 5, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 15.1, "is_final": false }

// final verdict after end_of_stream
{ "processed_until_seq": 6, "flagged": true, "needs_caution": false, "score": 0.94, "latency_ms": 82.6, "is_final": true }

Cadence and the safety boundary

Verdicts are emitted on a server-defined or client-configured cadence, for example every N milliseconds or N chunks. The server may lag behind generation; this is expected. The authoritative safety boundary is always processed_until_seq: only text up to that sequence has been evaluated. When end_of_stream=true arrives, the server finishes any buffered text and emits a final result with is_final=true.

Treat any stream error as terminal. Moderation decisions remain authoritative only up to the last acknowledged processed_until_seq. Retrying means starting a new stream. Error codes are listed in Error handling.

Version fingerprinting

Send REQUEST_TYPE="version" to get SHA-256 hashes of the deployed artifacts. This is how you verify exactly what is running, prove integrity, and track model versions for audit.

Returned artifacts may include the runtime model code, probe weights, subcategory classifiers, and configuration artifacts. In a regulated environment, capture these hashes alongside your deployment records so you can demonstrate which version produced a given verdict.

Error handling

Failures are isolated per request inside a Triton batch. One bad request returns a structured error without affecting its neighbors.

Common validation failures

Error	Cause
Missing modality payload	Missing `TEXT`, `IMAGE`, or `AUDIO`
Invalid conversation structure	Malformed `MESSAGES`
Invalid role	Unsupported role value
Empty content	Blank text or message content
Unsupported audio type	Unsupported file extension
Sequence overflow	Input exceeds the model context window

Streaming gRPC status codes

Status	When it is returned
`INVALID_ARGUMENT`	Malformed request stream: non-monotonic or duplicate `seq`, empty `text_chunk`, or invalid configuration
`FAILED_PRECONDITION`	Inconsistent stream state, such as end of stream before any content or a stream reused after a terminal response
`RESOURCE_EXHAUSTED`	Rate or concurrent-stream limits reached, input buffering over bounds, or sustained moderation lag beyond safety thresholds
`UNAVAILABLE`	Moderation backend not ready, restarting, or shedding load
`INTERNAL`	Unexpected server-side failure during inference or serialization
`CANCELLED`	Client terminated the stream early or disconnected

Partial results emitted before an error remain valid up to the last reported processed_until_seq.

Self-hosted deployment

This is the full walkthrough for running OmniGuard inside your own boundary on NVIDIA Triton. You download the model from the repository provisioned for your account, make it available to a Triton server, and bring up the moderation endpoints. Plan on about 15 to 20 minutes for the initial download.

Prerequisites

GPU

NVIDIA GPU with driver 550+. H100 recommended.

Server

NVIDIA Triton Inference Server (Docker image nvcr.io/nvidia/tritonserver:24.11-py3 or newer)

Client host

Any machine with network access to the Triton HTTP port. The client itself can run on CPU, including a Mac.

Credentials

The download token and repository provisioned for your account

OmniGuard ships in two inference modes. Mode 1 uses Flash Attention 2 and is recommended for faster inference. Mode 2 runs without Flash Attention 2. The download includes both variants.

Air-gapped by design. Once the artifacts are inside your environment, OmniGuard runs with no outbound dependency. The only access it assumes is the GPU through the NVIDIA driver. Nothing about the content you moderate leaves your boundary.

1 · Download the model

Set your provisioned token in the environment, then run the download script. It pulls both model variants, with and without Flash Attention 2, and assembles the Triton model named realmguard.

Set your token

bash

export HF_TOKEN="<your-provisioned-token>"

Run the download script

bash

chmod +x download_realmguard.sh && ./download_realmguard.sh

The script places the model files inside a new models directory. It uses huggingface-cli when available and falls back to wget otherwise.

The extracted model carries its own config.pbtxt, model.py, weights, and Python backend environment. It runs as an independent Triton model. Keep the files as-is; any pipeline that copies them must not alter their contents, or Triton may fail to load the model.

2 · Launch Triton

If you already run Triton, drop the realmguard model into your model repository and restart. To start a fresh server with Docker:

bash

docker run --rm --gpus all \
  -p8000:8000 -p8001:8001 -p8002:8002 \
  -v $(pwd)/models:/models \
  nvcr.io/nvidia/tritonserver:24.11-py3 \
  tritonserver \
    --model-repository=/models \
    --model-control-mode=explicit \
    --load-model=realmguard

Triton exposes inference on port 8000 (HTTP) and port 8001 (gRPC). A newer Triton image works too. The model uses a custom model.py with its own execution environment on the Python backend, and assumes only GPU access through the NVIDIA driver.

3 · Run the moderation client

The reference client, realmguard_api.py, implements the Triton client and opens the per-modality moderation endpoints over HTTP. It has minimal dependencies and can run on any machine, including CPU-only, as long as it can reach the Triton HTTP port.

requirements.txt

tritonclient[all]
fastapi
uvicorn
python-multipart

bash

# opens /v1/moderations/{text,image,audio}
python realmguard_api.py

The endpoints map one to one with the REST moderation API, so anything you build against the managed API works unchanged against your local client.

4 · Validate

Confirm the full path end to end with a request per modality.

bash

# text
curl -X POST http://localhost:9000/v1/moderations/text \
  -H "Content-Type: application/json" \
  -d '{"text": "How to kill a child process?"}'

# image
curl -X POST http://localhost:9000/v1/moderations/image \
  -F "file=@/path/to/image.png"

# audio
curl -X POST http://localhost:9000/v1/moderations/audio \
  -F "file=@/path/to/audio.wav"

A clean response on each confirms Triton is serving the model and the client is wired up correctly.

Streaming on-prem

Response streaming deploys the same way: a Triton-packaged model with a custom Python backend. The package includes config.pbtxt, model.py, the model weights, and the backend environment, and runs as an independent Triton model. Install the listed Python dependencies in your Triton environment, then use the Realm SDK or the API to open the streaming endpoint.

python · quick stream test

import asyncio, websockets, json

async def test():
    async with websockets.connect('ws://localhost:8000/v1/moderations/response-stream') as ws:
        await ws.send(json.dumps({
            'stream_id': 'test-123', 'seq': 0,
            'prompt': 'Hello', 'response_chunk': 'Hi there',
            'end_of_stream': False
        }))
        print(await ws.recv())
        await ws.send(json.dumps({'stream_id': 'test-123', 'seq': 1, 'response_chunk': '', 'end_of_stream': True}))
        print(await ws.recv())

asyncio.run(test())

You can also drive the stream interactively with tools like websocat or wscat. Adjust the endpoint and port to match your deployment. Streaming requires the original input prompt to be available alongside the generated output.

Operational notes

Audio codecs

Processing .m4a files needs the ffmpeg binary on the client host.

Client placement

The client only needs the Triton HTTP port (default 8000). Run it wherever is convenient, including on a laptop during testing.

Artifact integrity

Keep the extracted model files unchanged. Pipelines should present them as-is.

Audit

Use version fingerprinting to record exactly which artifacts are live.

Performance & benchmarks

The numbers below are measured on H100 with Flash Attention 2. One model serves every modality, so you are not stitching together separate classifiers with different latency profiles.

Latency

Workload	Latency
Text, under 100 tokens	19 ms
Text, under 256 tokens	27 ms
Text, under 512 tokens	31 ms
Image moderation	~140 to 170 ms
Audio moderation	~100 to 130 ms

Text moderation sits near 32 ms median and under 50 ms at p95. Image and audio scale predictably with input size.

Text moderation accuracy

F1 score by dataset, OmniGuard against two widely used open guard models.

Dataset	OmniGuard	LlamaGuard 4 12B	Qwen3Guard 8B
PolyGuardPrompts	85.8%	65.6%	87.7%
OpenAI Moderation	77.83%	74.0%	79.80%
LMSYS / ToxicChat	73.4%	37.3%	68.3%
Nemotron v3	83.42%	67.8%	84.9%

OmniGuard averages roughly 80% F1 across these sets, about 19 points above LlamaGuard 4 and on par with Qwen3Guard, at a fraction of their latency.

Audio and image accuracy

Audio dataset	F1
Omnibench	83.74%
AudioTrust	74.17%

Image dataset	F1
Custom dataset	83.33%
UnsafeBench	61.06%

Prompt injection

Dataset	Result
AgentDojo	100% recall
Promptfoo	95.98% recall
safeguard / prompt_injections	99.39% F1
Clean traffic (Salesforce / wikitext)	0.10% false positive rate

Near-total recall on injection attacks with a 0.10% false-positive rate on clean traffic, so detection does not come at the cost of blocking legitimate use.

Multilingual

OmniGuard classifies safety across 50+ languages rather than translating to English first. On the PolyGuard multilingual set it holds an overall prompt-moderation F1 near 0.85, with English, German, Spanish, Italian, and Russian among the strongest. Localization keeps accuracy high on lower-resource languages where English-only guards drop off sharply.

Compliance & security

OmniGuard is built to enforce Responsible AI policy and to give your auditors something concrete to point at.

Regulatory alignment

Detection mapped to the EU AI Act, the OWASP Top 10 for LLM applications, and FINRA. Categories follow the MLCommons safety taxonomy.

Custom policy

Tune thresholds and category behavior to your organization's rules rather than a fixed global setting.

Data residency

Self-host in your VPC, your on-prem cluster, or fully air-gapped. With self-hosting, no content leaves your boundary.

Auditability

SHA-256 artifact fingerprinting records exactly which model version produced each verdict.

Real-time enforcement

Beyond detection, OmniGuard supports inline actions: block, redact, and reroute, on prompts and on responses as they generate.

Realm Labs is SOC 2 aligned. For details on certifications, data handling, and an architecture review for your environment, talk to your account team.

Ready to put OmniGuard in front of your AI?

Whether you start on the managed API or self-host on day one, your Realm team will provision access, share the deployment artifacts, and review the integration with you.

Book a Demo Explore the platform

Questions on deployment, schemas, or benchmarks? Reach out at support@realmlabs.ai or through your Realm account team.