Mistral.rs provides a lightweight OpenAI API compatible HTTP server based on axum. The request and response formats are supersets of the OpenAI API.
The API consists of the following endpoints. They can be viewed in your browser interactively by going to http://localhost:<port>/docs.
ℹ️ Besides the HTTP endpoints described below
mistralrs-servercan also expose the same functionality via the MCP protocol.
Enable it with--mcp-port <port>and see MCP_SERVER.md for details.
To support additional features, we have extended the completion and chat completion request objects. Both have the same keys added:
top_k:int|null. If non null, it is only relevant if positive.grammar:{"type" : "regex" | "lark" | "json_schema" | "llguidance", "value": string}ornull. Grammar to use. This is mutually exclusive to the OpenAI-compatibleresponse_format.min_p:float|null. If non null, it is only relevant if 1 >= min_p >= 0.enable_thinking:bool, default tofalse. Enable thinking for models that support it.truncate_sequence:bool|null. Whentrue, requests that exceed the model context length will be truncated instead of rejected; otherwise the server returns a validation error. Embedding requests truncate tokens at the end of the prompt, while chat/completion requests truncate tokens at the start of the prompt.
Mistral.rs validates that the model parameter in API requests matches the model that was actually loaded by the server. This ensures requests are processed by the correct model and prevents confusion.
Behavior:
- If the
modelparameter matches the loaded model name, the request proceeds normally - If the
modelparameter doesn't match, the request fails with an error message indicating the mismatch - The special model name
"default"can be used to bypass this validation entirely
Examples:
- ✅ Request with
"model": "meta-llama/Llama-3.2-3B-Instruct"whenmeta-llama/Llama-3.2-3B-Instructis loaded → succeeds - ❌ Request with
"model": "gpt-4"whenmistral-7b-instructis loaded → fails - ✅ Request with
"model": "default"regardless of loaded model → always succeeds
Usage: Use "default" in the model field when you need to satisfy API clients that require a model parameter but don't need to specify a particular model. This is demonstrated in all the examples below.
Process an OpenAI compatible request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here. To control the interval keep-alive messages are sent, set the KEEP_ALIVE_INTERVAL environment variable to the desired time in ms.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "EMPTY"
)
completion = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are Mistral.rs, an AI assistant."},
{"role": "user", "content": "Write a story about Rust error handling."}
]
)
print(completion.choices[0].message)Or with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{
"role": "system",
"content": "You are Mistral.rs, an AI assistant."
},
{
"role": "user",
"content": "Write a story about Rust error handling."
}
]
}'A streaming request can also be created by setting "stream": true in the request JSON. Please see this guide.
ℹ️ Requests whose prompt exceeds the model's maximum context length now fail unless you opt in to truncation. Set
"truncate_sequence": trueto drop the oldest prompt tokens while reserving room (equal tomax_tokenswhen provided, otherwise one token) for generation. Specifically, tokens from the front of the prompt are dropped.
Returns the running models.
Example with curl:
curl http://localhost:<port>/v1/modelsReturns the server health.
Example with curl:
curl http://localhost:<port>/healthReturns OpenAPI API docs via SwaggerUI.
Example with curl:
curl http://localhost:<port>/docsProcess an OpenAI compatible completions request, returning an OpenAI compatible response when finished. Please find the official OpenAI API documentation here.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "EMPTY"
)
completion = client.completions.create(
model="default",
prompt="What is Rust?",
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
print(completion.choices[0].message)Or with curl:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"prompt": "What is Rust?"
}'ℹ️ The
truncate_sequenceflag behaves the same way for the completions endpoint: keep itfalse(default) to receive a validation error, or set it totrueto trim the prompt automatically.
Serve an embedding model (for example, EmbeddingGemma) to enable this endpoint:
./mistralrs-server run -m google/embeddinggemma-300mIn multi-model mode, include an Embedding entry in your selector config to expose it alongside chat models.
Create vector embeddings via the OpenAI-compatible endpoint. Supported request fields:
input: a single string, an array of strings, an array of token IDs ([123, 456]), or a batch of token arrays ([[...], [...]]).encoding_format:"float"(default) returns arrays off32;"base64"returns Base64 strings.dimensions: currently unsupported; providing it yields a validation error.truncate_sequence:bool, defaultfalse. Set totrueto clip over-length prompts instead of receiving a validation error.
ℹ️ Requests whose prompt exceeds the model's maximum context length now fail unless you opt in to truncation. Embedding requests truncate tokens from the end of the prompt.
Example (Python openai client):
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="EMPTY",
)
result = client.embeddings.create(
model="default",
input=[
"Embeddings capture semantic relationships between texts.",
"What is graphene?",
],
truncate_sequence=True,
)
for item in result.data:
print(item.index, len(item.embedding))Example with curl:
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": ["graphene conductivity", "superconductor basics"],
"encoding_format": "base64",
"truncate_sequence": false
}'Responses follow the OpenAI schema: object: "list", data[*].embedding containing either float arrays or Base64 strings depending on encoding_format, and a usage block (prompt_tokens, total_tokens). At present those counters report 0 because token accounting for embeddings is not yet implemented.
Create a response using the OpenAI-compatible Responses API. Please find the official OpenAI API documentation here.
To send a request with the Python openai library:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key = "EMPTY"
)
# First turn
resp1 = client.responses.create(
model="default",
input="Apples are delicious!"
)
print(resp1.output_text)
# Follow-up - no need to resend the first message
resp2 = client.responses.create(
model="default",
previous_response_id=resp1.id,
input="Can you eat them?"
)
print(resp2.output_text)Or with curl:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"input": "Tell me about Rust programming"
}'
# Follow-up using previous_response_id
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"previous_response_id": "resp_12345-uuid-here",
"input": "What makes it memory safe?"
}'The API also supports multimodal inputs (images, audio) and streaming responses by setting "stream": true in the request JSON.
ℹ️ The Responses API forwards
truncate_sequenceto underlying chat completions. Enable it if you want over-length conversations to be truncated rather than rejected.
Retrieve a previously created response by its ID.
Example with curl:
curl http://localhost:8080/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"Delete a stored response and its associated conversation history.
Example with curl:
curl -X DELETE http://localhost:8080/v1/responses/resp_12345-uuid-here \
-H "Authorization: Bearer EMPTY"Reapply ISQ to the model if possible. Pass the names as a JSON object with the key ggml_type to a string (the quantization level).
Example with curl:
curl http://localhost:<port>/re_isq -H "Content-Type: application/json" -H "Authorization: Bearer EMPTY" -d '{"ggml_type":"4"}'