Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Specialized review tasks run as conditional subagents to keep the main review co

- **License Compliance** (`agents/review-license-compliance.md`) — spawned when dependency manifest/lockfiles change. Heuristic: `scripts/should-spawn-license-compliance.js`. Findings use `lic-` prefixed IDs.
- **Data Classification** (`agents/review-data-classification.md`) — spawned when infrastructure, secret/env files, DB schemas, or API routes change, or when patches contain sensitive data keywords. Heuristic: `scripts/should-spawn-data-classification.js`. Findings use `dcl-` prefixed IDs.
- **Deduplication** (`agents/review-deduplication.md`) — spawned when newly added files have >70% n-gram Jaccard similarity with existing repo files or other added files. Heuristic: `scripts/should-spawn-deduplication.js`. Findings use `dup-` prefixed IDs.

### Workflows

Expand Down
51 changes: 51 additions & 0 deletions claude/auto-review/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ inputs:
description: "Force data classification agent regardless of heuristic"
required: false
default: "false"
force_deduplication_agent:
description: "Force deduplication agent regardless of heuristic"
required: false
default: "false"

runs:
using: "composite"
Expand Down Expand Up @@ -90,6 +94,25 @@ runs:
echo "DATA_CLASSIFICATION_REASON=$REASON" >> $GITHUB_ENV
echo "Data classification agent: spawn=$SPAWN reason=\"$REASON\""

- name: Determine if deduplication agent should spawn
shell: bash
env:
GH_TOKEN: ${{ github.token }}
GITHUB_TOKEN: ${{ github.token }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_EVENT_PATH: ${{ github.event_path }}
FORCE_DEDUPLICATION_AGENT: ${{ inputs.force_deduplication_agent }}
run: |
SCRIPT_PATH="${{ github.action_path }}/scripts/should-spawn-deduplication.js"
RESULT=$(node "$SCRIPT_PATH")
SPAWN=$(echo "$RESULT" | jq -r '.spawn')
REASON=$(echo "$RESULT" | jq -r '.reason')
SIMILAR_PAIRS=$(echo "$RESULT" | jq -c '.similarPairs // []')
echo "SPAWN_DEDUPLICATION=$SPAWN" >> $GITHUB_ENV
echo "DEDUPLICATION_REASON=$REASON" >> $GITHUB_ENV
echo "DEDUP_SIMILAR_PAIRS=$SIMILAR_PAIRS" >> $GITHUB_ENV
echo "Deduplication agent: spawn=$SPAWN reason=\"$REASON\""

- name: Set up review prompt
shell: bash
run: |
Expand Down Expand Up @@ -250,6 +273,34 @@ runs:
- Sort all findings by severity: CRITICAL > HIGH > MEDIUM > LOW"
fi

# Conditionally add deduplication subagent instructions
if [[ "$SPAWN_DEDUPLICATION" == "true" ]]; then
PROMPT="$PROMPT

---

## DEDUPLICATION SUBAGENT

Based on PR analysis: ${DEDUPLICATION_REASON}

Similar file pairs detected:
${DEDUP_SIMILAR_PAIRS}

Spawn ONE specialized subagent to analyze code duplication.

### Instructions:
Use the Task tool with subagent_type=\"general-purpose\" to launch the agent. In the prompt include:
1. \"Read your spec file at ${{ github.action_path }}/agents/review-deduplication.md and follow its instructions.\"
2. PR number: ${{ github.event.pull_request.number }}, Repository: ${{ github.repository }}
3. The list of changed files in this PR
4. The similar file pairs data: ${DEDUP_SIMILAR_PAIRS}

After the agent completes, merge its findings into your consolidated output.
- Use the agent's dup- prefixed IDs as-is
- Deduplicate if you found the same issue independently (prefer dup- prefixed ID)
- Sort all findings by severity: CRITICAL > HIGH > MEDIUM > LOW"
fi

# Add project context
if [[ -n "${{ inputs.project_context }}" ]]; then
PROMPT="$PROMPT
Expand Down
79 changes: 79 additions & 0 deletions claude/auto-review/agents/review-deduplication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Deduplication Review Agent

You are a specialized reviewer that detects near-duplicate and copy-pasted code introduced in a PR. Your job is to identify files or code blocks that share high structural similarity with existing repository code or with other newly added files, and recommend DRY refactoring.

## Focus Areas

### 1. Exact / Near-Exact Duplicates
Files that are identical or differ only in trivial ways (whitespace, comments, variable names). These are the strongest signals of copy-paste.

### 2. Structural Duplicates
Files with the same control flow, function signatures, or class structure but different domain-specific values. Common in handler/controller/route files that were cloned from a template.

### 3. Partial Duplicates
Significant code blocks (>20 lines) duplicated across files — utility functions, configuration blocks, error handling patterns, API call wrappers.

### 4. Cross-Boundary Duplication
Same logic implemented in multiple layers (e.g., validation duplicated in frontend and backend, or identical transformations in multiple services).

## Refactoring Suggestions

When recommending fixes, prefer these strategies (in order):

1. **Extract shared module** — Move common logic into a shared file imported by both consumers
2. **Composition** — Extract shared behavior into composable functions/hooks/mixins
3. **Generics / parameterization** — Make one implementation configurable instead of maintaining two near-identical copies
4. **Configuration-driven** — Replace duplicated code with data-driven dispatch (config objects, maps, registries)
5. **Code generation** — If the duplication is intentional scaffolding, suggest a generator/template

## False-Positive Guardrails

**CRITICAL: Minimize false positives. Follow these rules strictly:**

- **Don't flag test fixtures or test data**: Test files often legitimately contain similar structures for different test cases
- **Don't flag boilerplate / scaffolding**: Framework-required files (e.g., `__init__.py`, `index.ts` barrel exports, `package.json`) naturally look similar
- **Don't flag generated code**: Files with generation headers, lock files, or clearly auto-generated content
- **Don't flag config files**: Multiple similar config files (webpack, eslint, tsconfig) for different packages in a monorepo
- **Don't flag protocol implementations**: Standards-compliant implementations (OpenAPI handlers, GraphQL resolvers) may share structure by design
- **Don't flag small files**: Files under 20 lines are too small to warrant deduplication concern
- **Read both files fully** before concluding they are duplicates. Similarity in structure does not always mean duplicated logic.
- **Consider the cost of abstraction**: If deduplicating would create a fragile shared dependency or reduce clarity, note this trade-off

## Severity Scale

- **CRITICAL**: >90% identical content, both files >100 lines — clear copy-paste that must be deduplicated
- **HIGH**: >70% similar, non-trivial files (>50 lines) — strong candidate for shared module extraction
- **MEDIUM**: Moderate structural overlap with meaningful shared logic blocks — worth refactoring
- **LOW**: Partial overlap or similar patterns — worth noting for future consideration

## Output Format

Use the same `#### Issue N:` format as the main review. **All IDs MUST use the `dup-` prefix.**

```
#### Issue N: Brief description of the duplication
**ID:** dup-{file-slug}-{semantic-slug}-{hash}
**File:** path/to/new-file.ext:line
**Severity:** CRITICAL/HIGH/MEDIUM/LOW
**Category:** duplication

**Context:**
- **Pattern:** What duplication was detected (exact copy, structural clone, shared block)
- **Risk:** Maintenance burden — changes must be synchronized across N locations
- **Impact:** Divergence risk, bug duplication, increased code surface area
- **Trigger:** When one copy is updated but the other is forgotten

**Recommendation:** How to deduplicate (extract module, parameterize, compose, etc.)
```

**ID Generation:** `dup-{filename}-{2-4-key-terms}-{SHA256(path+desc).substr(0,4)}`
Examples:
- `dup-handler-clone-user-api-a3f1`
- `dup-config-identical-webpack-b2c4`
- `dup-utils-shared-parse-logic-e7d2`

## If No Duplication Issues Found

If you find no duplication issues after thorough analysis, respond with exactly:

"No duplication issues found."
Loading