Code Review Evaluator

A lightweight TypeScript framework that checks whether your prompt-engineered code review is actually detecting the same issues you defined as ground-truth.

The project loads sample code, asks an LLM to review it, compares the answer with an expected JSON, and returns an accuracy score – all with one command.

Note All example folders ship with synthetic snippets purposely crafted to mimic real-world bugs. They are not excerpts from client projects. Their only goal is to stress-test the review workflow and make it easy to iterate on prompts/metrics.

Test-case folder structure

Each example is a sub-directory inside examples/ and must contain exactly three files:

file	purpose
`file.(js	ts
`diff.txt`	Unified diff with your proposed changes
`suggestions.json`	Ground-truth analysis in the expected JSON schema

Why both file and diff? The reviewer LLM receives both artefacts so it can see the current state and the intended patch, mimicking a typical pull-request review.

🚀 Quick Start

# 1 – install dependencies
npm install

# 2 – Provide API keys (.env in project root)
OPENAI_API_KEY=sk-…
GEMINI_API_KEY=…
ANTHROPIC_API_KEY=…

# 3 – Run full evaluation (build + exec)
npm start

# dev mode with flags
npm run dev -- --reps 3 --prompt ./prompts/my_reviewer.md

CLI flags

flag	shortcut	default	description
`--limit`	`-l`	all	Run only the first N examples
`--reps`	`-r`	1	How many times to run each example (results averaged)
`--prompt`	`-p`	internal template	Path to a custom reviewer prompt. Must contain the placeholders `{file}` and `{diff}`

Running will:

Detect every folder in examples/
Call the reviewer LLM for each sample
Feed both ground-truth & LLM answer to the evaluator LLM
Print per-case accuracy and a global summary
Write evaluation_results.json with the raw data

📦 Generate LangSmith Dataset

To create a JSONL dataset compatible with LangSmith:

# build examples ➜ output/dataset.jsonl
npm run build:dataset

The script scans every folder inside examples/, grabs the code, diff, and ground-truth suggestions, then writes one JSON-encoded line per example to output/dataset.jsonl.

Each line follows this schema:

{
  "inputs": {
    "code": "<original source file>",
    "diff": "<unified diff patch>"
  },
  "outputs": {
    "suggestions": [{ /* reviewer issues as JSON */ }]
  }
}

You can now upload the resulting dataset.jsonl to LangSmith for prompt-quality tracking or further experimentation.

⚙️ Configuration

Everything is code-first – open (or copy) src/index.ts and edit the EvaluatorConfig:

const config: EvaluatorConfig = {
  // Which LLM reviews the code
  reviewerLLM: {
    provider: 'gemini',                // 'gemini' | 'openai' | 'claude'
    model: 'gemini-2.5-pro-preview-06-05',
    apiKey: process.env.GEMINI_API_KEY!,
    temperature: 0,
  },
  // Which LLM judges the review
  evaluatorLLM: {
    provider: 'openai',
    model: 'gpt-4o-mini',
    apiKey: process.env.OPENAI_API_KEY!,
    temperature: 0,
  },
  // Optional: run only a subset of folders (default = all)
  testCases: ['api_request_manager', 'task_queue'],
};

Changing provider automatically switches to the right SDK thanks to LLMFactory.ts.

🧠 How It Works

TestCaseLoader reads each example and returns a typed object.
Reviewer Prompt (see prompts.ts) asks the first LLM to produce a JSON array of issues.
Evaluator Prompt compares that answer with ground-truth using a second LLM.
Accuracy is computed from the evaluator JSON and printed.

📤 Result Schema (`evaluation_results.json`)

[
  {
    "testCase": "api_request_manager",
    "overallMatch": true,
    "overallSimilarity": 93,
    "totalSuggestions": 5,
    "matchedSuggestions": 4,
    "accuracy": 80.0,
    "suggestionMatches": [
      {
        "groundTruth": {/* one suggestion */},
        "llmResponse": {/* matched suggestion or null */},
        "isMatch": true,
        "similarity": 95,
        "reasoning": "Both flagged missing retry on 500 errors…"
      }
    ],
    "detailedAnalysis": "…long string from evaluator LLM…"
  },
  "…more cases…"
]

➕ Adding New Examples

The dataset is intentionally minimal. Feel free to add more synthetic scenarios to cover edge cases we miss today:

Create examples/<your_case>/.
Drop file.js, diff.txt, suggestions.json.
Run npm start or npm run dev to see the impact.

🏷️ Project Scope

This repo validates the review pipeline itself – it is not a linting tool nor aims for full code-quality coverage. The objective is to make sure a prompt-engineered reviewer catches the same issues we have in our ground truth so we can:

Experiment with different prompt styles quickly (pass a new file with --prompt).
Benchmark LLM versions/providers side-by-side.
Gradually expand the dataset with tougher cases and measure progress.

🔍 Internals / Source Guide

file	role
`src/llm/BaseLLM.ts`	common abstract class (validates config)
`src/llm/GeminiLLM.ts`	Wrapper around @google/genai (`GoogleGenAI`)
`src/llm/OpenAILLM.ts`	Wrapper around openai SDK (Chat Completions)
`src/llm/ClaudeLLM.ts`	Wrapper around @anthropic-ai/sdk
`src/llm/LLMFactory.ts`	`switch` that instantiates the correct wrapper
`src/prompts.ts`	Markdown prompt templates with JSON specs
`src/TestCaseLoader.ts`	Filesystem helper to read examples
`src/Evaluator.ts`	The orchestrator (loads ➜ prompts ➜ calls LLMs ➜ aggregates)

🛠️ NPM Scripts

script	what it does
`npm run dev`	Run `src/index.ts` directly with ts-node
`npm start`	Compile TypeScript ➜ run compiled `dist/index.js`
`npm run test:dev`	Quick sanity test on the first available example
`npm run build`	Compile TypeScript to `dist/`

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
evaluation_results.json		evaluation_results.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code Review Evaluator

Test-case folder structure

🚀 Quick Start

CLI flags

📦 Generate LangSmith Dataset

⚙️ Configuration

🧠 How It Works

📤 Result Schema (`evaluation_results.json`)

➕ Adding New Examples

🏷️ Project Scope

🔍 Internals / Source Guide

🛠️ NPM Scripts

License

About

Uh oh!

Releases

Packages

Languages

kodustech/eval

Folders and files

Latest commit

History

Repository files navigation

Code Review Evaluator

Test-case folder structure

🚀 Quick Start

CLI flags

📦 Generate LangSmith Dataset

⚙️ Configuration

🧠 How It Works

📤 Result Schema (evaluation_results.json)

➕ Adding New Examples

🏷️ Project Scope

🔍 Internals / Source Guide

🛠️ NPM Scripts

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📤 Result Schema (`evaluation_results.json`)

Packages