Why can't I just compare the AI's output to a reference solution?

Because two correct solutions to the same problem are almost never textually identical. A function that reverses a string with slicing and one that uses a reversed-iterator join are both correct, but a string-match check fails one of them. A deterministic harness evaluates behaviour — does the code pass the test cases, handle the edge cases, and meet the performance and security constraints — rather than matching the output string.

Why is retrieval needed if modern models have million-token context windows?

A large context window is not free. Stuffing an entire repository into every request raises latency and cost, and dilutes the signal — the model spends attention on files irrelevant to the question. Retrieval keeps the prompt focused on the handful of files that actually matter, which improves both answer quality and economics even when the window is large enough to hold everything.

What chunking strategy should I start with?

Start with function- or class-level chunks rather than fixed line counts. Splitting on syntactic boundaries keeps related logic together, which makes each chunk self-contained enough to be useful when retrieved in isolation. AST-based chunking is the most robust version of this for large repositories, at the cost of a language-specific parser in the indexing pipeline.

Deterministic LLM Eval Harnesses and Retrieval Over a Codebase

Pulling an LLM into a development workflow is the easy part. Making it trustworthy at scale comes down to two questions the demo never forces you to answer: how do you know the generated code is actually correct, and how do you give the model accurate context from a repository far too large to fit in a single prompt? This post covers the two systems that answer them — a deterministic evaluation harness and a retrieval layer over the codebase.

Retrieval: feeding the model only the slice that matters

A repository is far too large to drop into one prompt. The retrieval layer chunks and embeds the codebase ahead of time, then — for each question — embeds the query, finds the closest chunks, and hands only those to the model. That is what keeps the answer grounded in your actual code.

Why these two problems, together

AI-assisted development covers a lot of ground — code generation, test creation, reviews, bug fixing, documentation, refactoring. Every one of those tasks runs into the same two walls. The first is measurement: model output varies between runs, so you cannot tell whether a change to the prompt, the model, or the temperature made things better or worse without a way to score outputs consistently. The second is context: a model that has not seen the relevant files will confidently invent APIs that do not exist in your code. A deterministic harness solves the first; retrieval solves the second. The strongest internal copilots combine both.

Part 1 — A deterministic LLM evaluation harness

An evaluation harness is a testing framework whose job is to answer a single question: does this generated code correctly solve the given problem? Unlike a traditional unit test on deterministic code, the thing under test here is non-deterministic — the same prompt can produce different source across runs. A deterministic harness removes that ambiguity by scoring every submission against fixed, predefined criteria rather than against an exact expected string.

Why exact-match evaluation fails

Take the prompt "write a function to reverse a string." Two correct answers:

# Solution A
def reverse_string(s):
    return s[::-1]

# Solution B
def reverse_string(s):
    return ''.join(reversed(s))

Both are correct; they are textually different. Any evaluation that compares against a reference string marks one of them wrong. A robust harness assesses behaviour instead — functional correctness, edge-case handling, performance requirements, security compliance, and coding standards — none of which depend on the source matching a template.

The four components

1. Test-case execution. Generated code runs against predefined assertions. If they all pass, the solution is functionally correct for the cases you defined.

assert reverse_string("abc") == "cba"
assert reverse_string("") == ""        # empty edge case
assert reverse_string("racecar") == "racecar"  # palindrome

2. Sandboxed execution. Running model-generated code directly is a security risk — it may read the filesystem, open network connections, or burn unbounded CPU. Execute it in an isolated environment that restricts filesystem access, blocks unauthorised network requests, caps CPU and memory, and contains anything malicious. Docker containers, Kubernetes pods, and Firecracker microVMs are the common choices, in increasing order of isolation strength.

3. Deterministic model configuration. For results you can compare across runs, pin the generation settings so the model behaves as reproducibly as it can:

{
  "temperature": 0,
  "top_p": 1,
  "seed": 42
}

Reducing randomness will not make a model perfectly deterministic — batching and hardware non-determinism still leak in — but it removes the largest source of run-to-run variance and makes benchmarking meaningful.

4. Evaluation metrics. A handful of metrics covers most of what you need to track over time:

Metric	Purpose
Pass rate	Percentage of test executions that succeed
Accuracy	Correctness of the generated solution
Latency	Time required to generate a response
Token usage	Cost-efficiency measurement
Security score	Detection of vulnerabilities in the output

The evaluation workflow

Prompt
   ↓
LLM generates code
   ↓
Sandbox execution
   ↓
Run test cases
   ↓
Collect metrics
   ↓
Generate evaluation report

Wired into CI, this loop gives you reliable benchmarking, consistent quality measurements, regression detection when a model or prompt changes, and an automated gate before AI-generated code reaches a human reviewer. The payoff is trust: a number you can point at instead of a vibe.

Part 2 — Retrieval over a codebase

A real application is thousands of files, millions of lines, multiple services, and a pile of documentation. No model processes all of that in one request — and even where the context window is technically large enough, doing so is slow, expensive, and dilutes the signal. The answer is Retrieval-Augmented Generation (RAG): retrieve only the most relevant sections of the codebase, then hand those to the model.

Entire repository
      ↓
Retrieve relevant files
      ↓
Provide context to LLM
      ↓
Generate response

The result is better accuracy, higher relevance, and dramatically better economics, because the prompt stays focused on the handful of files that actually bear on the question.

Embeddings for code retrieval

An embedding is a numerical vector representation of text or code. This function —

def calculate_total(price, tax):
    return price + tax

— becomes something like [0.34, -0.12, 0.67, ...]. The useful property is that vectors capture semantic meaning: snippets with similar functionality land mathematically close together, so a similarity search finds code by what it does, not by the exact words it uses. The pipeline:

Code file
    ↓
Chunk code
    ↓
Generate embeddings
    ↓
Store in vector database
    ↓
Similarity search

Common embedding models include OpenAI Embeddings, Voyage AI, Cohere Embed, and the open BGE family. Common vector stores include Pinecone, Weaviate, ChromaDB, Milvus, and FAISS.

Chunking strategies

A 3,000-line file cannot be embedded effectively as one block — it exceeds retrieval efficiency and blurs the semantics. Split it, and retrieve only the chunks that matter. How you split is the single biggest lever on retrieval quality.

Fixed-length chunking — e.g. 500 lines per chunk — is simple and fast to preprocess, but it happily splits related logic down the middle and reduces semantic coherence.

Function-level chunking makes each function its own chunk, preserving a complete unit of business logic and improving retrieval precision:

def login():
    pass

def register():
    pass
# → each function becomes a separate chunk

AST-based chunking parses the abstract syntax tree and aligns chunk boundaries with classes, methods, functions, and modules. It is usually the most effective strategy for large repositories — at the cost of a language-specific parser in your indexing pipeline.

The retrieval workflow

Suppose a developer asks: "how does user authentication work in this project?" The system runs five steps:

Convert the question into an embedding vector.
Run a similarity search against the stored code embeddings.
Retrieve the relevant chunks — say AuthController, JWTService, LoginHandler.
Provide those chunks to the LLM as context.
Generate a context-aware answer.

The model answers accurately without ever loading the whole repository.

Context windows and why they still matter

A context window is how much information a model can process in a single request. Limits have grown fast —

Model	Context window
GPT-4-class	~128K tokens
Claude Sonnet-class	~200K tokens
Gemini Pro-class	1M+ tokens

— but enterprise repositories still dwarf them, and a big window has two failure modes. Provide too little context and the model is missing dependencies, imports, or business logic, so it hallucinates or answers incompletely. Provide too much and you pay in cost and latency while relevance drops, because the model spends attention on files that do not matter.

Context-optimization techniques

Top-K retrieval — pass only the most relevant chunks (e.g. the top 5), not everything that scored above a threshold.
Reranking — use a second, more precise model to reorder the retrieved candidates by true relevance before they hit the prompt.
Context compression — summarise large code blocks before passing them in, trading some fidelity for room.
Hierarchical retrieval — drill down repository → module → file → function, retrieving progressively rather than all at once.

Combining evaluation and retrieval

The two systems compose into a single trustworthy pipeline: retrieval gives the model the right context, generation produces a candidate, and the harness verifies it before anything reaches a human.

Developer query
       ↓
Code retrieval (RAG)
       ↓
Relevant context
       ↓
LLM generates code
       ↓
Evaluation harness tests output
       ↓
Approved response

That architecture is what separates a convincing demo from a system you can put in front of an engineering org: accurate understanding of a large codebase, reliable generation, and automated quality assurance on every output.

When we prototype this loop in a ShareCode code space, the part that benefits most from a second pair of eyes is the gate itself — watching a teammate trace exactly which retrieved chunks fed the prompt, and where the harness rejected an output, is usually how a subtle retrieval or evaluation gap gets caught before it reaches the rest of the team.

Where teams actually use this

Intelligent code assistants that help developers understand an unfamiliar repository.
Automated refactoring that proposes safe, context-aware improvements.
Bug detection that reasons with repository-wide context rather than a single file.
Test generation that scaffolds unit and integration tests from the real code.
Internal engineering copilots tuned to an organisation's own systems and conventions.

The takeaway

Adopting AI-assisted development well is not about bolting a model onto an editor. A deterministic evaluation harness makes generated code reliable, secure, and measurable; retrieval over the codebase lets the model actually understand large repositories through embeddings, sensible chunking, and disciplined context management. Together they are the foundation for AI development systems you can trust in production — and the teams that invest in both will ship faster without trading away software quality.

References & Sources

The primary sources, specifications, and documentation behind this article. Each link opens in a new tab.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, et al. · NeurIPS / arXiv:2005.11401 · 2020
The paper that named RAG — the pattern Part 2 builds on for feeding an LLM only the relevant slice of a repository.
arxiv.org
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. · OpenAI / arXiv:2107.03374 · 2021
Introduces Codex and the HumanEval benchmark, which scores functional correctness by executing tests rather than string-matching — the principle behind the harness in Part 1.
arxiv.org
Embeddings — measuring the relatedness of text strings
OpenAI · OpenAI Platform documentation
Reference for how embeddings turn code and queries into vectors whose distance captures semantic similarity, as described in the retrieval section.
platform.openai.com
Chunking Strategies for LLM Applications
Roie Schwaber-Cohen, Arjun Patel · Pinecone
A practical survey of fixed-size, content-aware, and semantic chunking — the trade-offs discussed when choosing how to split a large file before embedding.
pinecone.io
Faiss — a library for efficient similarity search
Meta AI (FAIR) · Facebook AI Research
Documentation for one of the vector stores named in the post; covers the approximate nearest-neighbour search behind similarity retrieval.
faiss.ai
Context windows
Anthropic · Anthropic documentation
Explains how a context window works and why accuracy and recall degrade as token count grows — the motivation for retrieval even when the window is large.
docs.anthropic.com

About the writers

Author

Kishan Vaghani

Founder & Lead Engineer, ShareCode

Founder of ShareCode. Writes the engineering deep-dives on this site — WebRTC, Firebase Auth, real-time sync, and the production patterns behind the editor itself.

Real-time collaboration & CRDTsWebRTC & low-latency mediaFirebase authentication & security rulesNext.js & full-stack JavaScript

Kajal Pansuriya

Developer Educator, ShareCode

Developer educator at ShareCode. Writes the tutorial track — Python, JavaScript debugging, coding-interview prep, and the everyday code-quality habits that hold up in real codebases.

Python fundamentals & teaching beginnersJavaScript debugging & DevToolsCoding-interview preparationClean code & code review

Building an internal AI dev tool?

Prototype the retrieval and evaluation loop in a shared code space — paste the pipeline, the prompts, and a sample repository slice, and walk a teammate through where the harness gates the output before it ships.

Open a code space →

Keep reading on the ShareCode blog

AI Failure Modes

The Five Ways AI-Generated Code Goes Wrong