Skip to content
This repository was archived by the owner on Apr 7, 2026. It is now read-only.

paiml/realizar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,620 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

realizar - Production LLM Inference Engine

realizar

Production LLM Inference Engine for the Sovereign AI Stack

crates.io docs.rs CI MIT License Rust 1.89+

What is realizar? | Installation | Usage | Features | Quality | Stack | Docs


Table of Contents

What is realizar?

realizar is a pure Rust LLM inference engine. It loads models in APR v2, GGUF, and SafeTensors formats, runs transformer inference with quantized kernels (Q4_K through Q8_0), and serves predictions over an OpenAI-compatible REST API.

Key design decisions:

  • Row-major mandate -- All tensors are row-major internally. GGUF column-major data is transposed at import by aprender. This matches PyTorch/SafeTensors layout and simplifies kernel implementations.
  • Pure Rust CUDA -- GPU acceleration via trueno-gpu generates PTX directly from Rust. No nvcc, no LLVM, no C++ dependencies.
  • Cost-based dispatch -- Backend selection (GPU/SIMD/scalar) uses a 5x PCIe cost model to avoid GPU overhead on small workloads.

Installation

CLI

cargo install realizar

Library

Add to your Cargo.toml:

[dependencies]
realizar = "0.8"

Usage

Serving

# Start demo server
realizar serve --demo --port 8080

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Library API

use realizar::chat_template::{auto_detect_template, ChatMessage};

let template = auto_detect_template("Qwen2-0.5B-Instruct");
let messages = vec![
    ChatMessage::system("You are a helpful assistant."),
    ChatMessage::user("Hello!"),
];
let formatted = template.format_conversation(&messages)?;

Tracing

Use the X-Trace-Level header for inference debugging:

# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...

# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...

Features

Model Formats

Format Description
APR v2 Native format with LZ4/ZSTD compression, zero-copy loading, Int4/Int8 quantization
GGUF llama.cpp-compatible quantized models
SafeTensors HuggingFace full-precision format

GPU Kernels

Kernel Purpose
GemmKernel Matrix multiplication (naive, tiled, tensor core)
AttentionKernel FlashAttention-style tiled attention
SoftmaxKernel Numerically stable with warp shuffle
LayerNormKernel Fused layer normalization
QuantizeKernel Q4_K dequantization fused with matmul
Q5KKernel Q5_K dequantization
Q6KKernel Q6_K dequantization

Quantization

Q4_0, Q8_0, Q4_K, Q5_K, Q6_K -- SIMD-accelerated on AVX2, AVX-512, and NEON. GPU dequantization fused with matrix operations to avoid memory round-trips.

KV Cache

Autoregressive decoding with persistent key-value cache. Supports grouped-query attention (GQA) for models like Qwen2.5 and Llama 3.

Chat Templates

Automatic template detection from model metadata:

Format Models System Prompt
ChatML Qwen2, Yi, OpenHermes Yes
Llama2 TinyLlama, Vicuna, LLaMA 2 Yes
Mistral Mistral-7B, Mixtral No
Phi Phi-2, Phi-3 Yes
Alpaca Alpaca, Guanaco Yes
Raw Fallback Passthrough
Custom Any (Jinja2) Configurable

Feature Flags

Flag Description
default server + cli + gpu
cuda NVIDIA CUDA support (pure Rust PTX, no nvcc)
minimal Core inference only
bench-http External server benchmarking

Benchmarks

LLM Inference (GPU)

Model Size Format Backend Throughput
Qwen2.5-Coder Q4_K_M 1.5B APR RTX 4090 (CUDA) 240 tok/s
Phi-2 Q4_K_M 2.7B GGUF RTX 4090 (CUDA) 276 tok/s
Phi-2 Q4_K_M 2.7B GGUF llama.cpp CUDA 256 tok/s
Phi-2 Q4_K_M 2.7B GGUF Ollama CUDA 228 tok/s

realizar achieves 8--21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation.

Classical ML (APR Format)

Model Parameters Latency Throughput
Iris 131 103ns 9.6M inferences/sec
MNIST 103K 73us 13.6K inferences/sec
Large NN 1M 410us 2.4K inferences/sec

Methodology follows Hoefler & Belli SC'15 (CV-based stopping, warmup iterations discarded).

Quality

  • 15,000+ tests across unit, integration, and property-based suites
  • 95%+ line coverage via cargo-llvm-cov
  • Zero clippy warnings with -D warnings
  • Mutation testing via cargo-mutants
  • Provable contracts -- 1,725 bindings with AllImplemented policy

Sovereign AI Stack

realizar is the inference layer of the PAIML Sovereign AI Stack:

Layer Crate Purpose
Compute trueno SIMD/GPU primitives (AVX2/AVX-512/NEON, wgpu)
ML aprender ML algorithms, APR v2 format
Training entrenar Autograd, LoRA/QLoRA, quantization
Inference realizar LLM inference, GPU kernels, model serving
Speech whisper-apr Pure Rust Whisper ASR
Distribution repartir Distributed compute (CPU/GPU/Remote)
Registry pacha Model registry with Ed25519 signatures
Orchestration batuta Stack coordination and CLI

Documentation

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines, or open an issue to discuss your idea first.

License

MIT -- Pragmatic AI Labs