InferHarness

A local-first tool for testing, comparing, and evaluating AI models and inference servers.

InferHarness helps you answer practical questions before you rely on a model in real work:

Does this model answer consistently?
Does it follow the format I asked for?
Does it call tools correctly?
Is it fast enough on my machine or server?
Did a model, prompt, or inference-server upgrade change behavior?

All data stays on your machine. InferHarness does not require a cloud account, hosted service, or telemetry connection. It runs as a browser interface backed by a local API and a local SQLite database.

Why InferHarness Exists

Local AI stacks are powerful, but they are not automatically predictable. Two models can expose the same OpenAI-compatible API and still behave differently. The same model can also behave differently depending on whether it is served by Ollama, LM Studio, llama.cpp, vLLM, or another runtime.

InferHarness was created to make those differences visible. It started as a way to investigate compatibility and tool-calling behavior across local inference servers and open-weight models. It has grown into a local evaluation environment for checking model quality, response format, latency, throughput, tool use, and regressions over time.

The goal is simple: make model and server decisions based on repeatable evidence instead of one-off manual prompt checks.

Who It Is For

InferHarness is useful for:

AI engineers comparing local models and inference servers.
Product teams validating whether a model is reliable enough for a workflow.
Developers testing prompts, tool calls, structured output, and regression behavior.
Researchers or hobbyists who want local, inspectable evaluation data.
Teams that cannot send prompts, responses, or datasets to a hosted evaluation service.

You do not need to understand the internal architecture to use the app. You define what you want to test, choose which model or server should run it, and review the results.

What You Can Test

InferHarness can help evaluate:

Answer quality: accuracy, relevance, coherence, completeness, and helpfulness.
Performance: time to first token, total latency, token counts, and tokens per second.
Format compliance: whether the model returns valid JSON or another expected structure.
Tool calling: whether the model calls the right tool with the right arguments.
Regression risk: whether behavior changed after a model, prompt, or server update.
Server differences: whether the same model behaves differently across runtimes.
Model metadata: model format, base model identity, quantization, capabilities, and architecture details.

How the Test Pipeline Works

InferHarness tests are designed to be repeatable and comparable.

You define a reusable test: the prompt, dataset, expected behavior, metrics, and pass conditions.
You choose the target model, inference server, and runtime settings.
InferHarness records the exact model, server, settings, and dataset proof used for the run.
The run executes against the selected model or models.
InferHarness stores the raw responses, normalized results, metrics, warnings, and errors.
You compare results over time, across models, or across inference servers.

This means a result is more than a screenshot or a manually copied answer. It is a recorded run with enough context to explain what happened and compare it with future runs.

Main Features

Server and model management Register local or remote inference servers, discover available models, and maintain a model catalog with provider, format, quantization, capabilities, and base-model metadata.

Reusable test definitions Create tests for one prompt, a dataset loop, tool-calling behavior, structured output, or multi-model comparisons.

Benchmark runs Run the same test against one model, many models, or the same model served by different inference servers.

Automated metrics Capture time to first token, total latency, prefill/decode timing, prompt tokens, completion tokens, and tokens per second.

Qualitative evaluation Score model answers on accuracy, relevance, coherence, completeness, and helpfulness. Compare Mode runs the same prompt across up to four models side by side.

Leaderboard Rank evaluated models by composite qualitative score and filter by date range or tag.

Model architecture inspection Inspect supported open-weight models without loading weight tensors. For Hugging Face models and local GGUF files, InferHarness can show a layer tree and parameter summaries.

Example Test Definitions

These examples show the kinds of tests a user can define.

Single prompt regression Check whether a model still answers one important prompt correctly.

Question: Does the model answer our support escalation prompt correctly?
Input: one customer-support scenario
Expected behavior: includes the required policy decision and avoids forbidden claims
Metrics: latency, output tokens, answer quality score
Pass condition: required decision is present and no forbidden claim appears

Dataset benchmark Run the same task across a file of examples and aggregate the results.

Question: Can the model classify 1,000 support tickets accurately?
Dataset: JSONL, CSV, or JSON file with labeled examples
Expected behavior: returns the correct category for each ticket
Metrics: accuracy, invalid response rate, average latency, p95 latency
Pass condition: accuracy is at least 92% and invalid responses stay below 1%

Tool-calling compliance Check whether a model calls the right tool with the right arguments.

Question: Can the model schedule a meeting using the available calendar tool?
Input: user asks for a meeting with date, attendees, and topic
Expected behavior: calls create_calendar_event once
Metrics: tool called, tool name match, argument validity, extra tool calls
Pass condition: exactly one valid tool call and no premature final answer

Structured output validation Check whether a model returns data in the shape your application expects.

Question: Can the model extract invoice fields as valid JSON?
Input: invoice text
Expected behavior: JSON with invoice_number, vendor_name, amount, currency, due_date
Metrics: JSON parse success, schema validity, field completeness, extraction accuracy
Pass condition: output is valid JSON and all required fields are present

Multi-model comparison Run the same test against several models or servers.

Question: Which local setup gives the best quality and speed for coding tasks?
Test: generate a unit test from a function signature
Targets: the same model served by Ollama, LM Studio, and llama-server
Metrics: pass rate, compile success, latency, token usage
Output: one result per target, then a comparison table

Typical Use Cases

Compare inference servers Register Ollama, LM Studio, and llama-server, point them at the same base model, and compare latency, throughput, and output quality.

Validate tool-calling behavior Test whether different models call the expected function with the expected arguments before you use tool calling in an application.

Check structured output reliability Measure how often a model returns valid JSON or another required response format.

Run regression tests before upgrades Keep a fixed test suite for important prompts and run it before and after changing a model, prompt, quantization, or inference-server version.

Build a local model leaderboard Score models with the same prompts and compare results over time.

Supported Inference Servers and Cloud Providers

InferHarness supports local inference servers and native cloud provider APIs.

Local inference servers

Server	API family	Notes
Ollama	Ollama + OpenAI-compatible	Model discovery via `/api/tags`
LM Studio	OpenAI-compatible	Serves local GGUF and MLX models
llama-server (llama.cpp)	OpenAI-compatible	Single-model, low-level inference
vLLM	OpenAI-compatible	High-throughput GPU inference
Inferencer	OpenAI-compatible + Ollama	High-end MLX inference server
Any OpenAI/Ollama-compatible server	OpenAI/Ollama-compatible	Custom auth header and token supported

Cloud providers

Provider	API family	Discovery endpoint	Auth
Anthropic	Anthropic native	`/v1/models`	`x-api-key` header
Google Gemini	Gemini native	`/v1beta/models`	`x-goog-api-key` header
OpenAI	OpenAI-compatible	`/v1/models`	Bearer token
Mistral	OpenAI-compatible	`/v1/models`	Bearer token
Groq	OpenAI-compatible	`/v1/models`	Bearer token
Together AI	OpenAI-compatible	`/v1/models`	Bearer token
Fireworks AI	OpenAI-compatible	`/v1/models`	Bearer token
OpenRouter	OpenAI-compatible	`/v1/models`	Bearer token
DeepSeek	OpenAI-compatible	`/v1/models`	Bearer token
xAI	OpenAI-compatible	`/v1/models`	Bearer token
Cerebras	OpenAI-compatible	`/v1/models`	Bearer token

Cloud providers use the direct public API (not Bedrock or Vertex AI). Tokens are never stored in plaintext; use the token_env field to reference an environment variable.

Model formats supported in the catalog include GGUF, MLX, GPTQ, AWQ, and SafeTensors.

Screenshots

Catalog — Servers · Browse, add, and probe inference servers	Catalog — Models · Filter and inspect discovered models
Templates · Author reusable JSON and Python test templates	Run · Execute templates against one or more models
Results · Dashboard, history, and leaderboard	Evaluate · Score model responses on five qualitative dimensions

Setup

Requirements:

Node.js 22.19 or newer, below Node.js 26.
Python 3.10 or newer for model architecture inspection and Python-based tests.
At least one inference server if you want to run live model tests.

Install dependencies:

npm install
pip install -r backend/src/scripts/requirements.txt
cp .env.example .env

Edit .env and set at least INFERHARNESS_API_TOKEN.

Run

Development

npm run dev

To run the backend on a different port, set both the backend PORT and the frontend API base URL. For example, to use port 9090:

PORT=9090 VITE_INFERHARNESS_API_BASE_URL=http://localhost:9090 npm run dev

For a persistent local setup, put the same values in .env:

PORT=9090
VITE_INFERHARNESS_API_BASE_URL=http://localhost:9090

If you start services separately, use the same pairing:

PORT=9090 npm run start:backend
VITE_INFERHARNESS_API_BASE_URL=http://localhost:9090 npm -w frontend run dev

Production build

npm ci
pip install -r backend/src/scripts/requirements.txt
npm run build
npm start

Tests

npm -w backend run test
npm -w frontend run test

Environment Variables

Required

Variable	Default	Description
`INFERHARNESS_API_TOKEN`	-	Shared token for API authentication.

App URLs and ports

Variable	Default	Description
`PORT`	`8080`	Backend HTTP port.
`VITE_INFERHARNESS_API_BASE_URL`	`http://localhost:8080`	Backend URL used by the browser.
`VITE_INFERHARNESS_FRONTEND_BASE_URL`	`http://localhost:5173`	Frontend base URL.
`VITE_INFERHARNESS_API_TOKEN`	-	Alternate frontend token environment variable.

Storage and retention

Variable	Default	Description
`INFERHARNESS_DB_PATH`	`./backend/data/db/inferharness.sqlite`	SQLite database file path.
`INFERHARNESS_TEST_TEMPLATES_DIR`	`./backend/data/templates`	Test template storage directory.
`RETENTION_DAYS`	`30`	Days to keep run results.

Inference connectivity

Variable	Default	Description
`INFERHARNESS_HEALTH_POLL_INTERVAL`	`30`	Seconds between inference-server health checks.
`INFERHARNESS_CONTEXT_PROBE_TIMEOUT_MS`	`300000`	Context probe and discovery timeout in milliseconds.
`CONNECTIVITY_TIMEOUT_MS`	`5000`	HTTP connectivity probe timeout in milliseconds.
`INFERHARNESS_BENCHMARK_DATASET_ROOT`	-	Absolute directory for server-side benchmark dataset files used by manifest-only runs.
`INFERHARNESS_INFERENCE_PROXY`	-	HTTP proxy for outbound inference-server requests.
`INFERHARNESS_INFERENCE_NO_PROXY`	`localhost,127.0.0.1`	Comma-separated no-proxy exceptions.
`INFERHARNESS_INFERENCE_TLS_INSECURE`	`false`	Set to `true` to disable TLS certificate verification for outbound inference-server requests, equivalent to curl `--insecure`.
`INFERHARNESS_PROXY_PERPLEXITY_DATASET`	-	Path to dataset file used by the proxy perplexity test protocol.

Python and model inspection

Variable	Default	Description
`INFERHARNESS_PYTHON_BIN`	`python3`	Python executable for subprocesses.
`HF_TOKEN` / `HUGGINGFACE_HUB_TOKEN`	-	Hugging Face token for gated model inspection.

Test-only

Variable	Default	Description
`INFERHARNESS_DRY_RUN`	-	Set to `1` to skip live HTTP calls in tests.

Technical Architecture

InferHarness runs as a local web application.

Browser UI
-> local backend API
-> SQLite database, local files, inference servers, optional Python subprocess

Frontend React single-page application served by Vite. It talks to the backend through the API and does not access the database directly.

Backend Fastify HTTP server responsible for server registration, model discovery, test execution, evaluation records, leaderboard data, and persistence.

Persistence SQLite stores application data. Local files store templates, cached metadata, and generated artifacts.

Python subprocess Used when a feature needs Python tooling, such as architecture inspection or Python-based test logic. Architecture inspection reads model configuration or GGUF metadata without loading model weights.

Inference servers External local or remote servers provide model inference through OpenAI-compatible, Ollama-compatible, Anthropic native, or Gemini native HTTP APIs.

Technical Stack

Layer	Technology
Runtime	Node.js 22+
Language	TypeScript 5
Backend framework	Fastify
Persistence	SQLite (better-sqlite3)
Frontend	React 18, Vite 8, TailwindCSS
Architecture inspection	Python 3.10+, `transformers`, `gguf`
Unit tests	Vitest
End-to-end tests	Playwright

For Contributors

The active backend schema catalog is documented in backend/src/schemas/README.md.

Troubleshooting

401 Unauthorized - confirm INFERHARNESS_API_TOKEN in .env matches the token used by the client.
409 Conflict with "Inference server has existing runs" - servers with runs must be archived, not deleted.
no such table - delete the SQLite file and restart; the schema is applied on startup.
python3 not found - install Python 3.10+ and verify it is on PATH, or set INFERHARNESS_PYTHON_BIN.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github		.github
backend		backend
frontend		frontend
.editorconfig		.editorconfig
.env.example		.env.example
.eslintignore		.eslintignore
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.npmignore		.npmignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferHarness

Why InferHarness Exists

Who It Is For

What You Can Test

How the Test Pipeline Works

Main Features

Example Test Definitions

Typical Use Cases

Supported Inference Servers and Cloud Providers

Screenshots

Setup

Run

Environment Variables

Technical Architecture

Technical Stack

For Contributors

Troubleshooting

About

Uh oh!

Releases 9

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InferHarness

Why InferHarness Exists

Who It Is For

What You Can Test

How the Test Pipeline Works

Main Features

Example Test Definitions

Typical Use Cases

Supported Inference Servers and Cloud Providers

Screenshots

Setup

Run

Environment Variables

Technical Architecture

Technical Stack

For Contributors

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Contributors

Uh oh!

Languages