realizar

Production LLM Inference Engine for the Sovereign AI Stack

What is realizar?

realizar is a pure Rust LLM inference engine. It loads models in APR v2, GGUF, and SafeTensors formats, runs transformer inference with quantized kernels (Q4_K through Q8_0), and serves predictions over an OpenAI-compatible REST API.

Key design decisions:

Row-major mandate -- All tensors are row-major internally. GGUF column-major data is transposed at import by aprender. This matches PyTorch/SafeTensors layout and simplifies kernel implementations.
Pure Rust CUDA -- GPU acceleration via trueno-gpu generates PTX directly from Rust. No nvcc, no LLVM, no C++ dependencies.
Cost-based dispatch -- Backend selection (GPU/SIMD/scalar) uses a 5x PCIe cost model to avoid GPU overhead on small workloads.

Installation

CLI

cargo install realizar

Library

Add to your Cargo.toml:

[dependencies]
realizar = "0.8"

Usage

Serving

# Start demo server
realizar serve --demo --port 8080

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Library API

use realizar::chat_template::{auto_detect_template, ChatMessage};

let template = auto_detect_template("Qwen2-0.5B-Instruct");
let messages = vec![
    ChatMessage::system("You are a helpful assistant."),
    ChatMessage::user("Hello!"),
];
let formatted = template.format_conversation(&messages)?;

Tracing

Use the X-Trace-Level header for inference debugging:

# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...

# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...

Features

Model Formats

Format	Description
APR v2	Native format with LZ4/ZSTD compression, zero-copy loading, Int4/Int8 quantization
GGUF	llama.cpp-compatible quantized models
SafeTensors	HuggingFace full-precision format

GPU Kernels

Kernel	Purpose
`GemmKernel`	Matrix multiplication (naive, tiled, tensor core)
`AttentionKernel`	FlashAttention-style tiled attention
`SoftmaxKernel`	Numerically stable with warp shuffle
`LayerNormKernel`	Fused layer normalization
`QuantizeKernel`	Q4_K dequantization fused with matmul
`Q5KKernel`	Q5_K dequantization
`Q6KKernel`	Q6_K dequantization

Quantization

Q4_0, Q8_0, Q4_K, Q5_K, Q6_K -- SIMD-accelerated on AVX2, AVX-512, and NEON. GPU dequantization fused with matrix operations to avoid memory round-trips.

KV Cache

Autoregressive decoding with persistent key-value cache. Supports grouped-query attention (GQA) for models like Qwen2.5 and Llama 3.

Chat Templates

Automatic template detection from model metadata:

Format	Models	System Prompt
ChatML	Qwen2, Yi, OpenHermes	Yes
Llama2	TinyLlama, Vicuna, LLaMA 2	Yes
Mistral	Mistral-7B, Mixtral	No
Phi	Phi-2, Phi-3	Yes
Alpaca	Alpaca, Guanaco	Yes
Raw	Fallback	Passthrough
Custom	Any (Jinja2)	Configurable

Feature Flags

Flag	Description
`default`	server + cli + gpu
`cuda`	NVIDIA CUDA support (pure Rust PTX, no nvcc)
`minimal`	Core inference only
`bench-http`	External server benchmarking

Benchmarks

LLM Inference (GPU)

Model	Size	Format	Backend	Throughput
Qwen2.5-Coder Q4_K_M	1.5B	APR	RTX 4090 (CUDA)	240 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	RTX 4090 (CUDA)	276 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	llama.cpp CUDA	256 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	Ollama CUDA	228 tok/s

realizar achieves 8--21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation.

Classical ML (APR Format)

Model	Parameters	Latency	Throughput
Iris	131	103ns	9.6M inferences/sec
MNIST	103K	73us	13.6K inferences/sec
Large NN	1M	410us	2.4K inferences/sec

Methodology follows Hoefler & Belli SC'15 (CV-based stopping, warmup iterations discarded).

Quality

15,000+ tests across unit, integration, and property-based suites
95%+ line coverage via cargo-llvm-cov
Zero clippy warnings with -D warnings
Mutation testing via cargo-mutants
Provable contracts -- 1,725 bindings with AllImplemented policy

Sovereign AI Stack

realizar is the inference layer of the PAIML Sovereign AI Stack:

Layer	Crate	Purpose
Compute	trueno	SIMD/GPU primitives (AVX2/AVX-512/NEON, wgpu)
ML	aprender	ML algorithms, APR v2 format
Training	entrenar	Autograd, LoRA/QLoRA, quantization
Inference	realizar	LLM inference, GPU kernels, model serving
Speech	whisper-apr	Pure Rust Whisper ASR
Distribution	repartir	Distributed compute (CPU/GPU/Remote)
Registry	pacha	Model registry with Ed25519 signatures
Orchestration	batuta	Stack coordination and CLI

Documentation

API docs: docs.rs/realizar
Repository: github.com/paiml/realizar
Cookbook: github.com/paiml/apr-cookbook

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines, or open an issue to discuss your idea first.

License

MIT -- Pragmatic AI Labs

Name		Name	Last commit message	Last commit date
Latest commit History 1,620 Commits
.config		.config
.dvc		.dvc
.github		.github
.pmat-work		.pmat-work
.pmat		.pmat
.pv		.pv
assets		assets
baselines/pytorch-wine		baselines/pytorch-wine
benches		benches
benchmarks		benchmarks
book		book
contracts		contracts
coverage_report/html		coverage_report/html
docs		docs
examples		examples
models		models
playbooks		playbooks
proptest-regressions		proptest-regressions
realizar/.pv		realizar/.pv
scripts		scripts
src		src
target_disk		target_disk
target_test		target_test
tests		tests
.cargo-mutants.toml		.cargo-mutants.toml
.clippy.toml		.clippy.toml
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.envrc		.envrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmessage		.gitmessage
.pmat-baseline.json		.pmat-baseline.json
.pmat-gates.toml		.pmat-gates.toml
.pmat-metrics.toml		.pmat-metrics.toml
.pmat.yaml		.pmat.yaml
.pmatignore		.pmatignore
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
EXTRACTION_CHECKLIST.md		EXTRACTION_CHECKLIST.md
GGUF_EXTRACTION_SUMMARY.md		GGUF_EXTRACTION_SUMMARY.md
ISSUES.md		ISSUES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_NOTES_v0.2.0.md		RELEASE_NOTES_v0.2.0.md
build.rs		build.rs
defect-report-20251126-113357.json		defect-report-20251126-113357.json
deny.toml		deny.toml
dhat-heap.json		dhat-heap.json
docker-compose.yml		docker-compose.yml
elf.o		elf.o
flake.nix		flake.nix
justfile		justfile
mutants.toml		mutants.toml
pmat.toml		pmat.toml
prometheus.yml		prometheus.yml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
todos.md		todos.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

realizar

Table of Contents

What is realizar?

Installation

CLI

Library

Usage

Serving

OpenAI-Compatible API

Library API

Tracing

Features

Model Formats

GPU Kernels

Quantization

KV Cache

Chat Templates

Feature Flags

Benchmarks

LLM Inference (GPU)

Classical ML (APR Format)

Quality

Sovereign AI Stack

Documentation

Contributing

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

realizar

Table of Contents

What is realizar?

Installation

CLI

Library

Usage

Serving

OpenAI-Compatible API

Library API

Tracing

Features

Model Formats

GPU Kernels

Quantization

KV Cache

Chat Templates

Feature Flags

Benchmarks

LLM Inference (GPU)

Classical ML (APR Format)

Quality

Sovereign AI Stack

Documentation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages