Skip to content

kpkaranam/code-dna

Repository files navigation

code-dna

Zero-Token Pre-Analysis Layer — give any LLM instant codebase understanding

npm version Node.js License: MIT

Table of Contents


The Problem

LLMs waste 50,000–200,000 tokens exploring unfamiliar codebases. Typical workflows involve asking the model to read file trees, open individual files, trace imports, and re-derive architecture facts it will forget next session. Context packers ship raw source code. Knowledge graphs need infrastructure.

The result: slow, expensive, and inconsistent onboarding every time a new LLM session touches your codebase.

The Solution

code-dna runs static analysis in under 5 seconds and produces a compact 5–10k token "DNA file" that gives any LLM architectural understanding — without reading source files.

The DNA file captures:

  • The project's module structure and symbol inventory
  • Architectural style, detected framework, and layer organisation
  • Coding conventions derived from the actual codebase
  • Hot files, risk scores, and dependency centrality
  • Git churn data and ownership information

Give any LLM the DNA file as its first context document and it hits the ground running.

Quick Start

# Run once, output to stdout
npx code-dna analyze

# Save to a file (recommended)
npx code-dna analyze --output CODEBASE-DNA.md

# YAML output for programmatic consumption
npx code-dna analyze --format yaml --output CODEBASE-DNA.yaml

# Analyse a specific directory
npx code-dna analyze /path/to/project --output CODEBASE-DNA.md

What It Extracts (4 Layers)

code-dna runs four analysis layers in sequence (Layers 1 and 2 execute in parallel):

Layer 1: Structural Skeleton

Discovers all source files, parses them with Tree-sitter AST grammars, and builds:

  • File tree with language and role annotations (controller, service, model, etc.)
  • Module map — hierarchical directory structure with per-file symbol inventories
  • Dependency graph — import/export edges with fan-in/fan-out metrics and circular dependency detection
  • Symbol index — every exported function, class, interface, type, and variable

Layer 2: Git Archaeology

Queries the local git history to surface temporal patterns:

  • Commit heatmap — files ranked by total commits
  • Ownership map — primary author per file
  • Co-change coupling — files that change together frequently (configurable window)
  • Hot files — churn hotspots with commit counts and last-modified timestamps

Gracefully skipped when no git history is available.

Layer 3: Pattern Inference

Uses Layer 1 results to infer higher-level patterns without configuration:

  • Framework detection — identifies Next.js, Express, FastAPI, Spring Boot, NestJS, and more from dependency manifests and file markers
  • Architecture style — classifies projects as MVC, hexagonal, layered, event-driven, or monolith
  • Naming conventions — detects camelCase, PascalCase, snake_case, kebab-case across files, functions, classes, and variables
  • File organisation — by-feature, by-layer, by-type, or hybrid
  • Import and export style — relative vs. aliased paths, named vs. default exports

Layer 4: Risk Surface

Combines all previous layers to produce a risk-ranked file list:

  • Centrality score — files with the highest in-degree (most imported)
  • Churn score — correlation between frequency of change and dependency weight
  • Coverage proxy — estimated test coverage based on co-located test files
  • Composite risk score — 0–100 rank with per-factor breakdowns

Supported Languages

Language Extensions Support Tier
TypeScript .ts, .tsx Full AST parsing
JavaScript .js, .jsx, .mjs, .cjs Full AST parsing
Python .py, .pyi Full AST parsing
Go .go File discovery + framework detection
Rust .rs File discovery + framework detection
Java .java File discovery + framework detection
Vue .vue File discovery + framework detection
C# .cs File discovery + framework detection
Ruby .rb File discovery + framework detection
Kotlin .kt, .kts File discovery + framework detection
Swift .swift File discovery + framework detection
PHP .php File discovery + framework detection
C / C++ .c, .h, .cpp, .cc, .cxx, .hpp File discovery + framework detection
Solidity .sol Discovery only

Run code-dna info to verify the languages and tiers detected by your installed version.

CLI Usage

analyze [path]

Run the full analysis pipeline and output DNA.

code-dna analyze [path] [options]

Arguments:

Argument Description Default
path Directory to analyse Current working directory

Options:

Flag Description Default
-f, --format <format> Output format: md or yaml md
-o, --output <file> Write output to file instead of stdout stdout
-l, --layers <layers> Comma-separated layers to run 1,2,3,4
--languages <langs> Language filter, e.g. ts,py,go all languages
--scope <dir> Scope analysis to a subdirectory none
--token-budget <n> Target token count for Markdown output 8000
--git-depth <n> Maximum git commits to traverse 1000
--no-git Skip git archaeology (disables Layer 2) false
-q, --quiet Suppress progress output false

Examples:

# Full analysis, Markdown output to stdout
code-dna analyze

# Save to file with YAML format
code-dna analyze . --format yaml --output CODEBASE-DNA.yaml

# Only structural skeleton, no git or risk analysis
code-dna analyze --layers 1,3

# Analyse only TypeScript and Python files
code-dna analyze --languages ts,py

# Scope to a single service in a monorepo
code-dna analyze --scope services/api --output services/api/DNA.md

# Large repo with tight token budget
code-dna analyze --token-budget 5000 --git-depth 500

diff <dna-a> <dna-b>

Compare two DNA YAML snapshots and produce a Markdown diff report.

code-dna diff before.yaml after.yaml
code-dna diff before.yaml after.yaml --output diff-report.md

The diff report covers: files added/removed/modified, symbols added/removed, dependency graph changes, risk score movements, convention and framework shifts.

mcp

Start the code-dna MCP server over stdio for use with MCP-compatible clients.

code-dna mcp
code-dna mcp --path /path/to/project
code-dna mcp --path /path/to/project --watch

See MCP Integration for client configuration details.

info

Show version, Node.js version, platform, and supported languages with their tiers.

code-dna info

MCP Integration

code-dna exposes its analysis pipeline as an MCP server, allowing LLM clients to query codebase DNA directly without running CLI commands.

Starting the Server

# Start against current directory
code-dna mcp

# Start against a specific project
code-dna mcp --path /path/to/project

# Watch mode: auto-refresh cache on file changes
code-dna mcp --path /path/to/project --watch

Claude Code Configuration

Add code-dna to your .mcp.json (project-scoped) or your global Claude Code settings:

{
  "mcpServers": {
    "code-dna": {
      "command": "npx",
      "args": ["code-dna", "mcp", "--path", "/absolute/path/to/project", "--watch"]
    }
  }
}

Cursor Configuration

In Cursor settings, add a new MCP server:

{
  "mcp": {
    "servers": {
      "code-dna": {
        "command": "npx",
        "args": ["code-dna", "mcp", "--path", "${workspaceFolder}", "--watch"]
      }
    }
  }
}

Available MCP Resources

Once connected, clients can read these resources:

URI Content
codedna://full Complete DNA Markdown output
codedna://skeleton Architecture and Module Map sections
codedna://dependencies Dependencies section
codedna://conventions Conventions section
codedna://risks Risk Surface and Hot Files sections
codedna://hotfiles Hot Files section only

Available MCP Tools

Tool Description
analyze Run analysis on a directory, update the cache, return full DNA
diff Compute a structural diff between two DNA Markdown strings

See docs/MCP.md for the full MCP reference including tool parameter schemas.

Configuration

Create a .codedna.yaml file in your project root to customise analysis:

# Additional glob patterns to ignore (built-in ignores always apply)
ignore:
  - "generated/**"
  - "vendor/**"
  - "*.pb.go"

# Toggle individual analysis layers
layers:
  skeleton: true
  git: true
  patterns: true
  risk: true

# Git archaeology settings
git:
  max_commits: 1000
  max_blame_files: 50
  coupling_window: 30   # days

# Per-language overrides
languages:
  python:
    enabled: true
    framework: "fastapi"   # override auto-detection
  solidity:
    enabled: false         # skip entirely

# Output preferences
output:
  format: md
  token_budget: 8000
  filename: CODEBASE-DNA.md
  sections:
    architecture: 15
    module_map: 25
    dependencies: 15
    conventions: 15
    hot_files: 10
    risk_surface: 10
    api_surface: 5

# Monorepo: include/exclude sub-directories
scope:
  include:
    - "services/api"
    - "packages/shared"
  exclude:
    - "packages/legacy"

All fields are optional and fall back to sensible defaults.

Programmatic API

code-dna can be used as a library from TypeScript or JavaScript:

npm install code-dna
import { analyze, formatMarkdown, formatYaml } from 'code-dna/lib';

// Run the full 4-layer analysis
const dna = await analyze('/path/to/project', {
  layers: [1, 2, 3, 4],
  tokenBudget: 8000,
});

// Render as Markdown (token-budget aware)
const markdown = formatMarkdown(dna, budget);

// Render as YAML (full data, no truncation)
const yaml = formatYaml(dna);

See docs/API.md for the complete programmatic API reference.

Example Output

The following is a truncated excerpt from code-dna analysing itself:

# Codebase DNA -- code-dna

> Generated by code-dna v0.1.0 on 2026-03-26.
> Languages: typescript (99%), javascript (1%) | Files: 101 | LOC: 35,864

## Architecture

**Style:** layered (85% confidence)
**Framework:** Node.js / Commander CLI

### Layers
- **cli** (3 files): entry point, MCP command
- **core** (8 files): engine, types, diff engine, token budget
- **analyzers** (6 files): git, framework, architecture, conventions, risk
- **parsers** (19 files): Tree-sitter extractors for 14 languages
- **output** (3 files): Markdown and YAML formatters
- **mcp** (2 files): MCP server

## Conventions

- **Files:** kebab-case
- **Functions:** camelCase
- **Classes:** PascalCase
- **Exports:** named
- **Imports:** external-first, relative paths
- **Tests:** co-located

## Risk Surface

| File | Score | Factors |
|------|-------|---------|
| src/core/engine.ts | 82 | high-centrality, high-churn |
| src/core/types.ts | 74 | high-centrality |
| src/parsers/parser-engine.ts | 65 | high-centrality |

Contributing

  1. Clone the repository and install dependencies: npm install
  2. Build: npm run build
  3. Run all tests: npm test (1199 tests, Node.js 20+ required)
  4. Lint: npm run lint
  5. Typecheck: npm run typecheck

All code changes require tests written first (TDD). Commits follow Conventional Commits (feat(scope):, fix(scope):).

License

MIT

About

Give any LLM instant codebase understanding. One command produces a compact DNA file capturing architecture, dependencies, conventions & risk surface — without burning tokens on file exploration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors