Skip to content

DharmikGohil/Sentient-SDK

Repository files navigation

Sentient-SDK

A developer-first SDK for automated RAG evaluation using LLM-as-a-Judge and Deterministic Guards.

CI npm version TypeScript License: MIT


The Problem

RAG (Retrieval-Augmented Generation) systems have a critical flaw: hallucinations kill trust.

When your LLM generates a response, how do you know if it's:

  • Faithful to the retrieved context?
  • Relevant to the user's query?
  • Free of hallucinations?

Manually reviewing responses doesn't scale. Traditional NLP metrics don't capture semantic meaning. You need a systematic approach.

The Solution

Sentient-SDK automates the Evaluation (Evals) phase of your RAG lifecycle:

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'Paris is the capital of France, established in 508 AD.',
    query: 'What is the capital of France?',
    response: 'The capital of France is Paris, a city founded in ancient Roman times.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

console.log(result);
// {
//   verdict: 'FAIL',
//   faithfulness: { score: 0.6, rationale: 'Incorrect founding date claim...' },
//   hallucination: { detected: true, evidence: '"ancient Roman times"...' },
//   guards: { piiLeak: false, forbiddenTerms: [] },
//   latencyMs: 1234,
// }

Architecture

Application
  └── uses Sentient SDK
          ├── Judge (LLM-based)        → OpenAI / Claude
          ├── Guards (Deterministic)   → PII / Forbidden Terms
          ├── Scorer (combines signals)→ Configurable thresholds
          ├── Reporter (structured)    → JSON output
          └── Shadow Runner (async)    → Non-blocking evaluation

Clean Architecture

sentient-sdk/
├── src/
│   ├── domain/           # Core types & interfaces
│   │   ├── types.ts
│   │   ├── Judge.ts
│   │   └── Guard.ts
│   ├── application/      # Use cases
│   │   ├── EvaluateRAG.ts
│   │   ├── ScoringPolicy.ts
│   │   └── ShadowRunner.ts
│   ├── infrastructure/   # Implementations
│   │   ├── judges/
│   │   │   ├── OpenAIJudge.ts
│   │   │   └── ClaudeJudge.ts
│   │   ├── guards/
│   │   │   ├── PiiGuard.ts
│   │   │   └── ForbiddenTermsGuard.ts
│   │   └── reporters/
│   │       └── JsonReporter.ts
│   ├── cli/
│   │   └── shadow.ts
│   └── index.ts
└── tests/

Installation

npm install sentient-sdk
# or
pnpm add sentient-sdk

Quick Start

Basic Evaluation

import { evaluate } from 'sentient-sdk';

const result = await evaluate(
  {
    context: 'The Eiffel Tower is 330 meters tall.',
    query: 'How tall is the Eiffel Tower?',
    response: 'The Eiffel Tower is 330 meters tall.',
  },
  {
    judge: 'openai',
    openaiApiKey: process.env.OPENAI_API_KEY,
  }
);

if (result.verdict === 'PASS') {
  console.log('Response is reliable');
} else {
  console.log('Response failed evaluation');
  console.log('Reason:', result.faithfulness.rationale);
}

Advanced Configuration

import { EvaluateRAG, OpenAIJudge, PiiGuard, ForbiddenTermsGuard, ScoringPolicy } from 'sentient-sdk';

// Create a custom evaluator
const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({
    apiKey: process.env.OPENAI_API_KEY,
    model: 'gpt-4o',
    temperature: 0,
  }),
  guards: [
    new PiiGuard({ types: ['email', 'phone', 'ssn'] }),
    new ForbiddenTermsGuard({
      terms: ['competitor_name', 'internal_secret'],
      caseInsensitive: true,
    }),
  ],
  scoringPolicy: new ScoringPolicy({
    faithfulnessThreshold: 0.8,  // Stricter than default 0.7
    relevanceThreshold: 0.6,
    failOnGuardViolation: true,
  }),
});

const result = await evaluator.run({ context, query, response });

Shadow Testing (Production)

Run evaluations on a percentage of live traffic without affecting latency:

import { ShadowRunner, EvaluateRAG, OpenAIJudge, PiiGuard } from 'sentient-sdk';

const evaluator = new EvaluateRAG({
  judge: new OpenAIJudge({ apiKey: process.env.OPENAI_API_KEY }),
  guards: [new PiiGuard()],
});

const shadow = new ShadowRunner({
  evaluator,
  sampleRate: 0.1, // 10% of requests
  onResult: (result, input) => {
    // Log to your observability platform
    metrics.record('rag.evaluation', {
      verdict: result.verdict,
      faithfulness: result.faithfulness.score,
      latencyMs: result.latencyMs,
    });
  },
  onError: (error, input) => {
    logger.error('Evaluation failed', { error, input });
  },
});

// In your RAG handler:
app.post('/chat', async (req, res) => {
  const response = await generateRAGResponse(req.body);
  
  // Fire-and-forget - doesn't block the response
  shadow.maybeEvaluate({
    context: response.retrievedContext,
    query: req.body.query,
    response: response.text,
  });
  
  return res.json(response);
});

CLI Usage

Evaluate a Single Response

export OPENAI_API_KEY="sk-..."

sentient eval \
  --context "Paris is the capital of France." \
  --query "What is the capital of France?" \
  --response "The capital of France is Paris."

Shadow Evaluation from JSONL

# Input file: evaluations.jsonl
# {"context": "...", "query": "...", "response": "..."}
# {"context": "...", "query": "...", "response": "..."}

sentient shadow \
  --input evaluations.jsonl \
  --output results.jsonl \
  --sample-rate 1.0 \
  --judge openai \
  --verbose

Evaluation Result Schema

interface EvaluationResult {
  // LLM-based evaluations
  faithfulness: {
    score: number;           // 0-1, higher is more faithful
    rationale: string;       // Explanation
    unsupportedClaims?: string[];
  };
  relevance: {
    score: number;           // 0-1, higher is more relevant
    rationale?: string;
  };
  hallucination: {
    detected: boolean;
    confidence: number;      // 0-1
    evidence?: string;       // Quote of hallucinated content
  };
  
  // Deterministic checks
  guards: {
    piiLeak: boolean;
    forbiddenTerms: string[];
    piiDetails?: string[];
  };
  
  // Final verdict
  verdict: 'PASS' | 'FAIL';
  evaluatedAt: string;       // ISO timestamp
  latencyMs: number;
}

Testing Philosophy

This SDK follows TDD (Test-Driven Development):

  1. Tests are written before implementation
  2. Every feature has corresponding test cases
  3. Mock judges make tests deterministic and fast
# Run all tests
pnpm test

# Run with coverage
pnpm test:coverage

# Run specific test files
pnpm test:run tests/domain

Current test coverage: 68 tests across 8 test files.


Supported Judges

Judge Model Use Case
OpenAIJudge GPT-4o (default) High accuracy, production use
ClaudeJudge Claude 3.5 Sonnet Alternative provider
Custom Any Judge interface Bring your own LLM

Built-in Guards

Guard Detects
PiiGuard Emails, phones, SSNs, credit cards, IPs
ForbiddenTermsGuard Custom banned words/phrases

API Reference

evaluate(input, options)

Quick evaluation function for simple use cases.

EvaluateRAG

Main evaluation orchestrator with full configuration.

ShadowRunner

Async, sampled evaluation for production traffic.

ScoringPolicy

Configurable thresholds for pass/fail determination.

JsonReporter

Structured JSON output for evaluation results.


What This Enables in Production

  • CI/CD Integration: Fail builds if response quality drops
  • Observability: Track faithfulness scores over time
  • A/B Testing: Compare prompt versions objectively
  • Compliance: Detect PII leaks before they reach users
  • Quality Gates: Block low-quality responses from reaching users

Contributing

Contributions are welcome! Please read the contributing guidelines first.

  1. Fork the repository
  2. Create a feature branch
  3. Write tests first (TDD)
  4. Submit a PR

Built with ❤️ Dharmik For RAG Reliability!

About

The Sentient-SDK is a fantastic choice because it bridges the gap between your existing expertise in RAG platforms and the high-demand field of AI Observability. It builds directly on your experience with TypeScript, Clean Architecture, and LLM integrations.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors