Quality control for AI pipelines — one MCP tool. Works with Claude Code, Claude Desktop, and any MCP client.
29.5% of teams do NO evaluation of AI outputs. (LangChain Survey) Knowledge workers spend 4.3 hours/week fact-checking AI outputs. (Microsoft 2025)
AgentDesk MCP fixes this. Add independent adversarial review to any AI pipeline in 30 seconds.
npx @ezark-publish/agentdesk-mcpclaude mcp add agentdesk-mcp -- npx @ezark-publish/agentdesk-mcp{
"mcpServers": {
"agentdesk-mcp": {
"command": "npx",
"args": ["-y", "@ezark-publish/agentdesk-mcp"],
"env": { "ANTHROPIC_API_KEY": "sk-ant-..." }
}
}
}Run as an HTTP server for remote access, Smithery hosting, or multi-client setups:
# Start with HTTP transport on port 3100
MCP_HTTP_PORT=3100 npx @ezark-publish/agentdesk-mcp
# Or use the --http flag (defaults to port 3100)
npx @ezark-publish/agentdesk-mcp --httpMCP endpoint: POST http://localhost:3100/mcp
Health check: GET http://localhost:3100/health
npm install github:Rih0z/agentdesk-mcpANTHROPIC_API_KEYenvironment variable (uses your own key — BYOK)
Adversarial quality review of any AI-generated output. An independent reviewer assumes the author made mistakes and actively looks for problems.
Input:
| Parameter | Required | Description |
|---|---|---|
output |
Yes | The AI-generated output to review |
criteria |
No | Custom review criteria |
review_type |
No | Category: code, content, factual, translation, etc. |
model |
No | Reviewer model (default: claude-sonnet-4-6) |
Output:
{
"verdict": "PASS | FAIL | CONDITIONAL_PASS",
"score": 82,
"issues": [
{
"severity": "high",
"category": "accuracy",
"description": "Claim about X is unsupported",
"suggestion": "Add citation or remove claim"
}
],
"checklist": [
{
"item": "Factual accuracy",
"status": "pass",
"evidence": "All statistics match cited sources"
}
],
"summary": "Overall assessment...",
"reviewer_model": "claude-sonnet-4-6"
}Dual adversarial review — two independent reviewers assess the output from different angles, then a merge agent combines findings.
- If either reviewer finds a critical issue → merged verdict is FAIL
- Takes the lower score
- Combines and deduplicates all issues
Use for high-stakes outputs where quality is critical.
Same parameters as review_output.
- Adversarial prompting: The reviewer is instructed to assume mistakes were made. No benefit of the doubt.
- Evidence-based checklist: Every PASS item requires specific evidence. Items without evidence are automatically downgraded to FAIL.
- Anti-gaming validation: If >30% of checklist items lack evidence, the entire review is forced to FAIL with a capped score of 50.
- Structured output: Verdict + numeric score + categorized issues + checklist (not just "looks good").
- Code review: Check for bugs, security issues, performance problems
- Content review: Verify accuracy, readability, SEO, audience fit
- Factual verification: Validate claims in AI-generated text
- Translation quality: Check accuracy and naturalness
- Data extraction: Verify completeness and correctness
- Any AI output: Summaries, reports, proposals, emails, etc.
Self-review has systematic leniency bias. An LLM reviewing its own output shares the same blind spots that created the errors. Research shows models are 34% more likely to use confident language when hallucinating.
AgentDesk uses a separate reviewer invocation with adversarial prompting — fundamentally different from self-review.
| Feature | AgentDesk MCP | Manual prompt | Braintrust | DeepEval |
|---|---|---|---|---|
| One-tool setup | Yes | No | No | No |
| Adversarial review | Yes | DIY | No | No |
| Dual reviewer | Yes | DIY | No | No |
| Anti-gaming validation | Yes | No | No | No |
| No SDK required | Yes | Yes | No | No |
| MCP native | Yes | No | No | No |
- Prompt injection: Like all LLM-as-judge systems, adversarial inputs could attempt to manipulate reviewer verdicts. The anti-gaming validation layer mitigates superficial gaming, but determined adversarial inputs remain a challenge. For high-stakes use cases, combine with deterministic validation.
- BYOK cost: Each
review_outputcall makes 1 LLM API call;review_dualmakes 3. Factor this into your pipeline costs.
For teams that prefer HTTP integration, a hosted REST API with additional features (agent marketplace, context learning, workflows) is available at agentdesk.usedevtools.com.
git clone https://github.com/Rih0z/agentdesk-mcp.git
cd agentdesk-mcp
npm install
npm test # 35 tests
npm run buildMIT
Built by EZARK Consulting | Web Version