Last Updated: 2025-11-06
The MLXR daemon is the background service that orchestrates inference requests, manages models, handles continuous batching, and provides REST APIs (OpenAI-compatible and Ollama-compatible).
Status: ✅ Complete
- Continuous batching with prefill/decode queues
- KV block allocation and preemption
- Request state management (WAITING → PREFILLING → DECODING → COMPLETED)
- Batch formation with token budget constraints
- Statistics tracking (queue depths, KV utilization, throughput)
Key Features:
- Separate prefill and decode queues for optimal batching
- KV cache block allocator with free list
- Preemption policy for OOM scenarios
- Configurable batch size and token limits
Files:
scheduler/scheduler.h- Scheduler interface and configscheduler/scheduler.cpp- Full implementation (430 lines)scheduler/request.h- Request data structures
Status: ✅ Complete
- OpenAI-compatible API endpoints
- SSE streaming support for chat/completions
- Scheduler integration with token callbacks
- HTTP server using cpp-httplib
- CORS support and API key authentication
Endpoints Implemented:
- ✅
POST /v1/chat/completions- Chat with streaming support - ✅
POST /v1/completions- Text completion with streaming - ✅
POST /v1/embeddings- Generate embeddings - ✅
GET /v1/models- List available models - ✅
GET /v1/models/:id- Get model info - ✅
GET /health- Health check
Files:
server/rest_server.h- REST server interface (345 lines)server/rest_server.cpp- Full implementation (1440 lines)server/sse_stream.h- SSE streaming utilitiesserver/sse_stream.cpp- SSE implementation
Status: ✅ Complete
- Background thread that executes batches from scheduler
- Calls
Engine::forward_prefill()andEngine::forward_decode() - Manages per-request KV caches
- Token callback invocation for streaming
- Error handling and cache cleanup
Features:
- Continuous polling loop with 1ms sleep when idle
- Automatic cache initialization and cleanup
- Sampler integration with request parameters
- Graceful shutdown handling
Files:
server/scheduler_worker.h- Worker interface (102 lines)server/scheduler_worker.cpp- Full implementation (230 lines)
Status: ✅ Complete
- High-level inference API
- Prefill and decode phases with KV cache
- Integration with MLX model and tokenizer
- Sampling strategies (temperature, top-k, top-p)
Key Methods:
forward_prefill(tokens, cache)- Process prompt and populate KV cacheforward_decode(token, cache)- Generate next token using cachegenerate_tokens(tokens)- Legacy full-sequence generation
Files:
core/runtime/engine.h- Engine interface (191 lines)core/runtime/engine.cpp- Full implementation (247 lines)
Status: ✅ Complete and Running
- Main executable at
build/cmake/bin/test_daemon - Initializes scheduler, worker, and REST server
- Listens on port 11434 (Ollama default)
- Graceful shutdown on SIGINT/SIGTERM
- Currently runs in "mock mode" (no model loaded)
Verified Working:
- ✅ Server starts and listens on port 11434
- ✅ Health endpoint returns
{"status":"ok"} - ✅ Models endpoint returns empty list
- ✅ Graceful shutdown works correctly
- ✅ Scheduler worker thread starts/stops cleanly
- ✅ TinyLlama model loads successfully (201 tensors, 2.0GB)
- ✅ Inference pipeline works end-to-end (prefill + decode)
- ✅ GQA attention with 4 KV heads and 32 query heads
- ✅ KV cache populated correctly across all 22 layers
- ✅ Chat completion generates tokens successfully
Status: 🚧 Headers defined, implementation ~70% complete
Endpoints Defined:
/api/generate- Ollama-style text generation/api/chat- Ollama-style chat/api/embeddings- Ollama-style embeddings/api/pull- Download model from registry/api/create- Create model from Modelfile/api/tags- List local models/api/ps- List running models/api/show- Show model info/api/copy- Copy model/api/delete- Delete model
Files:
server/ollama_api.h- Complete interface (295 lines)server/ollama_api.cpp- Partial implementation (687 lines)
TODO:
⚠️ Wire Ollama endpoints into REST server routing⚠️ Integrate with scheduler for/api/generateand/api/chat⚠️ Implement model management endpoints with registry
Status: 🚧 Full implementation, needs integration
Features:
- SQLite-backed model catalog
- GGUF parser for model metadata extraction
- Model info, adapters, tags
- Query and filtering
Files:
registry/model_registry.h- Registry interface (302 lines)registry/model_registry.cpp- Full implementation (748 lines)registry/gguf_parser.h- GGUF parser interfaceregistry/gguf_parser.cpp- GGUF implementation (530 lines)
TODO:
⚠️ Initialize registry in daemon startup⚠️ Auto-discover models in configured directories⚠️ Integrate with Ollama API model management
Status: 🚧 Full implementation, needs wiring
Features:
- Counter, Gauge, Histogram metrics types
- Metrics registry singleton
- Standard metrics defined (requests, tokens, latency, memory)
- SystemMonitor for CPU/GPU tracking
- Prometheus and JSON export
Files:
telemetry/metrics.h- Metrics interface (274 lines)telemetry/metrics.cpp- Full implementation (510 lines)
TODO:
⚠️ ImplementSystemMonitor::monitor_loop()body (CPU/GPU/memory tracking)⚠️ Wire metrics into REST server handlers⚠️ Add metrics endpoint (GET /metricsfor Prometheus format)
Status: ⏳ Not started
Required:
configs/server.yamlparser- Load server config (port, bind address, API key)
- Load scheduler config (batch sizes, KV blocks)
- Load model search paths
Suggested Library: yaml-cpp (available via Homebrew)
Status: ⏳ Stub only
Required:
- Load GGUF models via mmap
- Load safetensors models
- Create Engine from loaded model
- Integrate with registry
Note: Current test_daemon runs without a loaded model (mock mode).
Status: ⏳ Handlers exist but not wired into server
Required:
- Add Ollama routes to
rest_server.cppalongside OpenAI routes - Map
/api/*paths toOllamaAPIHandlermethods - Enable SSE streaming for Ollama endpoints
HTTP Request (POST /v1/chat/completions)
↓
RestServer::handle_chat_completion()
↓
Create scheduler::Request with token_callback
↓
scheduler->submit_request(request)
↓
Scheduler::get_next_batch() [continuous batching]
↓
SchedulerWorker::execute_batch()
↓
engine->forward_prefill(tokens, cache) [first token]
↓
engine->forward_decode(token, cache) [subsequent tokens]
↓
Sampler::sample(logits, context) → next_token
↓
request->add_generated_token(token)
↓
token_callback(token_id, finished)
↓
SSE chunk sent to client
Daemon Startup
↓
Load server.yaml config
↓
Initialize ModelRegistry (SQLite)
↓
Scan models directory
↓
Load default model via GGUF/Safetensors loader
↓
Create Engine(model, tokenizer)
↓
Initialize Scheduler
↓
Start SchedulerWorker(scheduler, engine)
↓
Start RestServer (OpenAI + Ollama routes)
↓
Listen for HTTP requests
-
Add YAML Config Support
- Install
yaml-cpp:brew install yaml-cpp - Create
ConfigLoaderclass - Parse
configs/server.yaml - Load into
ServerConfigandSchedulerConfig
- Install
-
Create Model Loader Utility
core/runtime/model_loader.h/cppload_model_from_gguf(path) -> shared_ptr<LlamaModel>load_model_from_safetensors(path) -> shared_ptr<LlamaModel>- Integrate with
graph::load_llama_model()
-
Update
test_daemon_main.cpp- Load config from YAML
- Initialize model registry
- Load a test model (e.g., TinyLlama)
- Pass engine to SchedulerWorker
- Test real inference flow
-
Wire Ollama API Routes
- Add Ollama endpoints to
rest_server.cpp - Create
OllamaAPIHandlerinstance - Route
/api/*paths to handler methods - Set scheduler on Ollama handler
- Add Ollama endpoints to
-
Implement SystemMonitor
- Complete
monitor_loop()with macOS system APIs - Use
host_statistics()for CPU/memory - Use
IOKitorMetalAPIs for GPU stats - Poll every 1 second, update gauges
- Complete
-
Wire Metrics into REST Server
- Increment
requests_totalcounter on each request - Record
request_duration_mshistogram - Track
time_to_first_token_msin streaming - Update
active_requestsgauge
- Increment
-
Add Metrics Endpoint
GET /metrics→ Prometheus formatGET /v1/metrics→ JSON format- Enable via config flag
-
Create
mlxrunnerdMain- Production daemon binary (not test)
- Config file path via CLI arg
- Logging to
~/Library/Logs/mlxrunnerd.log - PID file for process management
-
launchd Agent
- Create
.plistfor~/Library/LaunchAgents/ - Auto-start on login
- Respawn on crash
- Environment variables
- Create
-
Unix Domain Socket Support
- Optional UDS listener in addition to HTTP
- Path:
~/Library/Application Support/MLXRunner/run/mlxrunner.sock - Capability token auth
File: tests/unit/scheduler_test.cpp
- ✅ Construction and initialization
- ✅ Request submission and state management
- ✅ Batch scheduling (prefill queue)
- ✅ Request cancellation
- ✅ Request lookup by ID
- ✅ KV block allocation and deallocation
- ✅ KV block exhaustion handling
- ✅ Concurrent request submission (thread safety - 4 threads, 100 requests)
- ✅ Scheduler shutdown
⚠️ 2 minor expectation mismatches (non-blocking)
File: tests/unit/scheduler_worker_test.cpp
- ✅ Worker construction with scheduler and engine
- ✅ Start/stop lifecycle
- ✅ Multiple start/stop cycles
- ✅ Worker thread stability
- ✅ Request processing without engine (graceful degradation)
- ✅ Multiple requests without engine
- ✅ Stop while processing (clean shutdown)
- ✅ Repeated start/stop cycles (5 iterations)
- ✅ Scheduler shutdown coordination
Key Fix: Added null engine checks in scheduler_worker.cpp to prevent segfaults during testing without a full inference engine.
- Daemon starts and stops cleanly
- Health endpoint returns 200 OK
- Models endpoint returns empty list (no models loaded)
- Scheduler creates batches correctly
- SchedulerWorker thread starts and polls
- REST server handles CORS
- Graceful shutdown via SIGINT
- Load TinyLlama model successfully ✅ COMPLETED 2025-11-06
- Generate text via
/v1/chat/completions✅ COMPLETED 2025-11-06 - Streaming SSE works end-to-end
- Concurrent requests batch correctly
- KV cache blocks allocated and freed
- Preemption works when KV blocks exhausted
- Ollama
/api/generateendpoint works - Model registry CRUD operations
- Metrics endpoint returns valid Prometheus format
- SystemMonitor reports CPU/GPU stats
- GQA Reshape Error (2025-11-06) - FIXED
- Issue:
[reshape] Cannot reshape array of size 2304 into shape (1,9,32,64) - Cause: MLX lazy evaluation creating non-contiguous tensors after repeat operations
- Fix: Added strategic
mlx::core::eval()calls after repeat and before concatenation - Documentation: docs/GQA_RESHAPE_FIX.md
- Issue:
| Component | Header | Implementation | Status |
|---|---|---|---|
| Scheduler | scheduler/scheduler.h | scheduler/scheduler.cpp | ✅ Complete |
| REST Server | server/rest_server.h | server/rest_server.cpp | ✅ Complete |
| Scheduler Worker | server/scheduler_worker.h | server/scheduler_worker.cpp | ✅ Complete |
| SSE Streaming | server/sse_stream.h | server/sse_stream.cpp | ✅ Complete |
| Inference Engine | core/runtime/engine.h | core/runtime/engine.cpp | ✅ Complete |
| Ollama API | server/ollama_api.h | server/ollama_api.cpp | 🚧 Partial |
| Model Registry | registry/model_registry.h | registry/model_registry.cpp | 🚧 Complete, needs integration |
| GGUF Parser | registry/gguf_parser.h | registry/gguf_parser.cpp | ✅ Complete |
| Telemetry | telemetry/metrics.h | telemetry/metrics.cpp | 🚧 Complete, needs wiring |
| Test Daemon | N/A | test_daemon_main.cpp | ✅ Complete |
| CMake Build | N/A | daemon/CMakeLists.txt | ✅ Complete |
The current implementation is designed for:
- Batch size: Up to 64 concurrent requests
- Token budget: 4096 tokens/batch (prefill + decode)
- KV blocks: 1024 blocks × 16 tokens = 16,384 cached tokens
- Latency target: <80ms per decode token
- Throughput: Optimized for M4 GPU (Metal kernels in Phase 2)
The MLXR daemon core is fully functional! ✅
The scheduler, REST server, scheduler worker, and inference engine all work together correctly. The daemon starts, listens for HTTP requests, and gracefully shuts down. The continuous batching logic is complete with KV cache management.
What's next:
- Load a real model (TinyLlama or Llama-2-7B)
- Test end-to-end inference
- Wire Ollama API endpoints
- Add telemetry and metrics
- Production deployment with config files
The foundation is solid. We're ready to move from "mock mode" to real inference! 🚀