GGUF REST API Server

This repository is provided as an example/experimental project.

Feel free to copy, modify, and use any part of the code without restriction.

GGUF REST API Server

The FastAPI-based REST API server presented in this example uses the Qwen3 thinking model in GGUF format using llama-cpp-python as the inference backend.

Model training

If you want to train a model, feel free to use my notebook.

Or use Unsloth via

Installation

Create and activate 'venv' (not mendatory)

python3 -m venv venv

# Linux/macOS
source venv/bin/activate

# Windows
venv\Scripts\activate

Install from requirements.txt

pip install -r requirements.txt

Usage

Download model

Edit repo_id and filename in download_model.py
Model page used as an example: https://huggingface.co/TeichAI/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-GGUF

model_path = hf_hub_download(
        repo_id="TeichAI/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-GGUF",
        filename="Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill.iq4_nl.gguf",
        local_dir="./models"
    )

Start download model

python3 download_model.py

Check whether the model file (GGUF) has been downloaded to the /models folder after the model download is complete.

Setup environment variables

Open .env.example

MODEL_PATH=./models/your-downloaded-model-file.gguf
MODEL_NAME=your-downloaded-model-name
HOST=0.0.0.0
PORT=8000
N_CTX=8192 # Model context window
N_THREADS=4
N_GPU_LAYERS=0
MAX_TOKENS=2048

Save the file. Then, rename it and remove ".example" from .env.example

Start server

python3 main.py

API endpoints

Check server and model status
GET /health

OpenAI-compatible chat endpoint
POST /v1/chat/completions

Raw text completion endpoint
POST /v1/completions

Endpoints example

/v1/chat/completions

Example request:

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "messages": [{"role": "user", "content": "What is 3x3?"}],
    "max_tokens": 64,
    "strip_thinking": false,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "stream": false
}'

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "messages": [{"role": "user", "content": "<|system|>\nYou are an AI assistant.\n<|user|>\nWhy trees occasionally resist stillness?\n<|assistant|>\n"}],
    "stop": ["<|user|>", "<|system|>"],
    "max_tokens": 128,
    "strip_thinking": false,
    "temperature": 0.7,
        "top_p": 0.9,
    "top_k": 40,
    "stream": false
}'

Example response:

{
    "id": "chatcmpl-abc123",
    "model": "qwen3-4b-thinking",
    "choices": [{
        "index": 0,
        "message": {"role": "assistant", "content": "It expands to nine."},
        "finish_reason": "stop"
    }],
    "usage": {"prompt_tokens": 16, "completion_tokens": 10, "total_tokens": 26}
}

/v1/completions

Example request:

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "prompt": "What is 18 divided by 2?",
    "max_tokens": 64,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": false
}'

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "prompt": "<|system|>\nYou are an AI assistant.\n<|user|>\nWhy trees occasionally resist stillness?\n<|assistant|>\n",
    "stop": ["<|user|>", "<|system|>"],
    "max_tokens": 128,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": false
}'

Example response:

{
    "model": "qwen3-4b-thinking",
    "choices": [{"text": "It equals 9.", "finish_reason": "stop"}],
    "usage": {"prompt_tokens": 4, "completion_tokens": 50, "total_tokens": 54}
}

Closing

This project is an experimental setup exploring a FastAPI-based REST API powered by a Qwen3 thinking model in GGUF format, using llama-cpp-python as the inference backend. It’s a work in progress, intended for testing ideas and learning, so expect changes and rough edges. Contributions, feedback, and experimentation are always welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
sample_data		sample_data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
download_model.py		download_model.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGUF REST API Server

Model training

Installation

Create and activate 'venv' (not mendatory)

Install from requirements.txt

Usage

Download model

Start download model

Setup environment variables

Start server

API endpoints

Endpoints example

/v1/chat/completions

Example request:

Example response:

/v1/completions

Example request:

Example response:

Closing

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GGUF REST API Server

Model training

Installation

Create and activate 'venv' (not mendatory)

Install from requirements.txt

Usage

Download model

Start download model

Setup environment variables

Start server

API endpoints

Endpoints example

/v1/chat/completions

Example request:

Example response:

/v1/completions

Example request:

Example response:

Closing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages