Production LLM Inference Engine for the Sovereign AI Stack
What is realizar? | Installation | Usage | Features | Quality | Stack | Docs
- What is realizar?
- Installation
- Usage
- Features
- Benchmarks
- Quality
- Sovereign AI Stack
- Documentation
- Contributing
- License
realizar is a pure Rust LLM inference engine. It loads models in APR v2, GGUF, and SafeTensors formats, runs transformer inference with quantized kernels (Q4_K through Q8_0), and serves predictions over an OpenAI-compatible REST API.
Key design decisions:
- Row-major mandate -- All tensors are row-major internally. GGUF column-major data is transposed at import by aprender. This matches PyTorch/SafeTensors layout and simplifies kernel implementations.
- Pure Rust CUDA -- GPU acceleration via trueno-gpu generates PTX directly from Rust. No nvcc, no LLVM, no C++ dependencies.
- Cost-based dispatch -- Backend selection (GPU/SIMD/scalar) uses a 5x PCIe cost model to avoid GPU overhead on small workloads.
cargo install realizarAdd to your Cargo.toml:
[dependencies]
realizar = "0.8"# Start demo server
realizar serve --demo --port 8080
# Health check
curl http://localhost:8080/health
# Prometheus metrics
curl http://localhost:8080/metrics# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'use realizar::chat_template::{auto_detect_template, ChatMessage};
let template = auto_detect_template("Qwen2-0.5B-Instruct");
let messages = vec![
ChatMessage::system("You are a helpful assistant."),
ChatMessage::user("Hello!"),
];
let formatted = template.format_conversation(&messages)?;Use the X-Trace-Level header for inference debugging:
# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...
# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...| Format | Description |
|---|---|
| APR v2 | Native format with LZ4/ZSTD compression, zero-copy loading, Int4/Int8 quantization |
| GGUF | llama.cpp-compatible quantized models |
| SafeTensors | HuggingFace full-precision format |
| Kernel | Purpose |
|---|---|
GemmKernel |
Matrix multiplication (naive, tiled, tensor core) |
AttentionKernel |
FlashAttention-style tiled attention |
SoftmaxKernel |
Numerically stable with warp shuffle |
LayerNormKernel |
Fused layer normalization |
QuantizeKernel |
Q4_K dequantization fused with matmul |
Q5KKernel |
Q5_K dequantization |
Q6KKernel |
Q6_K dequantization |
Q4_0, Q8_0, Q4_K, Q5_K, Q6_K -- SIMD-accelerated on AVX2, AVX-512, and NEON. GPU dequantization fused with matrix operations to avoid memory round-trips.
Autoregressive decoding with persistent key-value cache. Supports grouped-query attention (GQA) for models like Qwen2.5 and Llama 3.
Automatic template detection from model metadata:
| Format | Models | System Prompt |
|---|---|---|
| ChatML | Qwen2, Yi, OpenHermes | Yes |
| Llama2 | TinyLlama, Vicuna, LLaMA 2 | Yes |
| Mistral | Mistral-7B, Mixtral | No |
| Phi | Phi-2, Phi-3 | Yes |
| Alpaca | Alpaca, Guanaco | Yes |
| Raw | Fallback | Passthrough |
| Custom | Any (Jinja2) | Configurable |
| Flag | Description |
|---|---|
default |
server + cli + gpu |
cuda |
NVIDIA CUDA support (pure Rust PTX, no nvcc) |
minimal |
Core inference only |
bench-http |
External server benchmarking |
| Model | Size | Format | Backend | Throughput |
|---|---|---|---|---|
| Qwen2.5-Coder Q4_K_M | 1.5B | APR | RTX 4090 (CUDA) | 240 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | RTX 4090 (CUDA) | 276 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | llama.cpp CUDA | 256 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | Ollama CUDA | 228 tok/s |
realizar achieves 8--21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation.
| Model | Parameters | Latency | Throughput |
|---|---|---|---|
| Iris | 131 | 103ns | 9.6M inferences/sec |
| MNIST | 103K | 73us | 13.6K inferences/sec |
| Large NN | 1M | 410us | 2.4K inferences/sec |
Methodology follows Hoefler & Belli SC'15 (CV-based stopping, warmup iterations discarded).
- 15,000+ tests across unit, integration, and property-based suites
- 95%+ line coverage via cargo-llvm-cov
- Zero clippy warnings with
-D warnings - Mutation testing via cargo-mutants
- Provable contracts -- 1,725 bindings with AllImplemented policy
realizar is the inference layer of the PAIML Sovereign AI Stack:
| Layer | Crate | Purpose |
|---|---|---|
| Compute | trueno | SIMD/GPU primitives (AVX2/AVX-512/NEON, wgpu) |
| ML | aprender | ML algorithms, APR v2 format |
| Training | entrenar | Autograd, LoRA/QLoRA, quantization |
| Inference | realizar | LLM inference, GPU kernels, model serving |
| Speech | whisper-apr | Pure Rust Whisper ASR |
| Distribution | repartir | Distributed compute (CPU/GPU/Remote) |
| Registry | pacha | Model registry with Ed25519 signatures |
| Orchestration | batuta | Stack coordination and CLI |
- API docs: docs.rs/realizar
- Repository: github.com/paiml/realizar
- Cookbook: github.com/paiml/apr-cookbook
Contributions are welcome! Please see CONTRIBUTING.md for guidelines, or open an issue to discuss your idea first.
MIT -- Pragmatic AI Labs