← All articles
🤖 AI WorkflowsAdvanced · 14 min read

Deterministic LLM Eval Harnesses and Retrieval Over a Codebase

Two building blocks every serious AI-assisted development system needs: a harness that reliably measures whether generated code works, and a retrieval layer that feeds the model only the slice of the repository that matters.

By Kishan Vaghani · Reviewed by Kajal Pansuriya · Published June 2, 2026

Pulling an LLM into a development workflow is the easy part. Making it trustworthy at scale comes down to two questions the demo never forces you to answer: how do you know the generated code is actually correct, and how do you give the model accurate context from a repository far too large to fit in a single prompt? This post covers the two systems that answer them — a deterministic evaluation harness and a retrieval layer over the codebase.

Why these two problems, together

AI-assisted development covers a lot of ground — code generation, test creation, reviews, bug fixing, documentation, refactoring. Every one of those tasks runs into the same two walls. The first is measurement: model output varies between runs, so you cannot tell whether a change to the prompt, the model, or the temperature made things better or worse without a way to score outputs consistently. The second is context: a model that has not seen the relevant files will confidently invent APIs that do not exist in your code. A deterministic harness solves the first; retrieval solves the second. The strongest internal copilots combine both.

Part 1 — A deterministic LLM evaluation harness

An evaluation harness is a testing framework whose job is to answer a single question: does this generated code correctly solve the given problem? Unlike a traditional unit test on deterministic code, the thing under test here is non-deterministic — the same prompt can produce different source across runs. A deterministic harness removes that ambiguity by scoring every submission against fixed, predefined criteria rather than against an exact expected string.

Why exact-match evaluation fails

Take the prompt "write a function to reverse a string." Two correct answers:

# Solution A
def reverse_string(s):
    return s[::-1]

# Solution B
def reverse_string(s):
    return ''.join(reversed(s))

Both are correct; they are textually different. Any evaluation that compares against a reference string marks one of them wrong. A robust harness assesses behaviour instead — functional correctness, edge-case handling, performance requirements, security compliance, and coding standards — none of which depend on the source matching a template.

The four components

1. Test-case execution. Generated code runs against predefined assertions. If they all pass, the solution is functionally correct for the cases you defined.

assert reverse_string("abc") == "cba"
assert reverse_string("") == ""        # empty edge case
assert reverse_string("racecar") == "racecar"  # palindrome

2. Sandboxed execution. Running model-generated code directly is a security risk — it may read the filesystem, open network connections, or burn unbounded CPU. Execute it in an isolated environment that restricts filesystem access, blocks unauthorised network requests, caps CPU and memory, and contains anything malicious. Docker containers, Kubernetes pods, and Firecracker microVMs are the common choices, in increasing order of isolation strength.

3. Deterministic model configuration. For results you can compare across runs, pin the generation settings so the model behaves as reproducibly as it can:

{
  "temperature": 0,
  "top_p": 1,
  "seed": 42
}

Reducing randomness will not make a model perfectly deterministic — batching and hardware non-determinism still leak in — but it removes the largest source of run-to-run variance and makes benchmarking meaningful.

4. Evaluation metrics. A handful of metrics covers most of what you need to track over time:

MetricPurpose
Pass ratePercentage of test executions that succeed
AccuracyCorrectness of the generated solution
LatencyTime required to generate a response
Token usageCost-efficiency measurement
Security scoreDetection of vulnerabilities in the output

The evaluation workflow

Prompt
   ↓
LLM generates code
   ↓
Sandbox execution
   ↓
Run test cases
   ↓
Collect metrics
   ↓
Generate evaluation report

Wired into CI, this loop gives you reliable benchmarking, consistent quality measurements, regression detection when a model or prompt changes, and an automated gate before AI-generated code reaches a human reviewer. The payoff is trust: a number you can point at instead of a vibe.

Part 2 — Retrieval over a codebase

A real application is thousands of files, millions of lines, multiple services, and a pile of documentation. No model processes all of that in one request — and even where the context window is technically large enough, doing so is slow, expensive, and dilutes the signal. The answer is Retrieval-Augmented Generation (RAG): retrieve only the most relevant sections of the codebase, then hand those to the model.

Entire repository
      ↓
Retrieve relevant files
      ↓
Provide context to LLM
      ↓
Generate response

The result is better accuracy, higher relevance, and dramatically better economics, because the prompt stays focused on the handful of files that actually bear on the question.

Embeddings for code retrieval

An embedding is a numerical vector representation of text or code. This function —

def calculate_total(price, tax):
    return price + tax

— becomes something like [0.34, -0.12, 0.67, ...]. The useful property is that vectors capture semantic meaning: snippets with similar functionality land mathematically close together, so a similarity search finds code by what it does, not by the exact words it uses. The pipeline:

Code file
    ↓
Chunk code
    ↓
Generate embeddings
    ↓
Store in vector database
    ↓
Similarity search

Common embedding models include OpenAI Embeddings, Voyage AI, Cohere Embed, and the open BGE family. Common vector stores include Pinecone, Weaviate, ChromaDB, Milvus, and FAISS.

Chunking strategies

A 3,000-line file cannot be embedded effectively as one block — it exceeds retrieval efficiency and blurs the semantics. Split it, and retrieve only the chunks that matter. How you split is the single biggest lever on retrieval quality.

Fixed-length chunking — e.g. 500 lines per chunk — is simple and fast to preprocess, but it happily splits related logic down the middle and reduces semantic coherence.

Function-level chunking makes each function its own chunk, preserving a complete unit of business logic and improving retrieval precision:

def login():
    pass

def register():
    pass
# → each function becomes a separate chunk

AST-based chunking parses the abstract syntax tree and aligns chunk boundaries with classes, methods, functions, and modules. It is usually the most effective strategy for large repositories — at the cost of a language-specific parser in your indexing pipeline.

The retrieval workflow

Suppose a developer asks: "how does user authentication work in this project?" The system runs five steps:

  1. Convert the question into an embedding vector.
  2. Run a similarity search against the stored code embeddings.
  3. Retrieve the relevant chunks — say AuthController, JWTService, LoginHandler.
  4. Provide those chunks to the LLM as context.
  5. Generate a context-aware answer.

The model answers accurately without ever loading the whole repository.

Context windows and why they still matter

A context window is how much information a model can process in a single request. Limits have grown fast —

ModelContext window
GPT-4-class~128K tokens
Claude Sonnet-class~200K tokens
Gemini Pro-class1M+ tokens

— but enterprise repositories still dwarf them, and a big window has two failure modes. Provide too little context and the model is missing dependencies, imports, or business logic, so it hallucinates or answers incompletely. Provide too much and you pay in cost and latency while relevance drops, because the model spends attention on files that do not matter.

Context-optimization techniques

Combining evaluation and retrieval

The two systems compose into a single trustworthy pipeline: retrieval gives the model the right context, generation produces a candidate, and the harness verifies it before anything reaches a human.

Developer query
       ↓
Code retrieval (RAG)
       ↓
Relevant context
       ↓
LLM generates code
       ↓
Evaluation harness tests output
       ↓
Approved response

That architecture is what separates a convincing demo from a system you can put in front of an engineering org: accurate understanding of a large codebase, reliable generation, and automated quality assurance on every output.

Where teams actually use this

The takeaway

Adopting AI-assisted development well is not about bolting a model onto an editor. A deterministic evaluation harness makes generated code reliable, secure, and measurable; retrieval over the codebase lets the model actually understand large repositories through embeddings, sensible chunking, and disciplined context management. Together they are the foundation for AI development systems you can trust in production — and the teams that invest in both will ship faster without trading away software quality.

Related reading

The harness in Part 1 exists largely to catch the problems described in the five ways AI-generated code goes wrong, and the security score it tracks leans on the verification routine in catching plausible-but-wrong security advice from LLMs. The day-to-day workflow that surrounds all of this is covered in pair-programming with an LLM without losing the craft. All sit inside the ai-assisted-development topic.

About the writers

Author

Kishan Vaghani

Founder & Lead Engineer, ShareCode

Founder of ShareCode. Writes the engineering deep-dives on this site — WebRTC, Firebase Auth, real-time sync, and the production patterns behind the editor itself.

Real-time collaboration & CRDTsWebRTC & low-latency mediaFirebase authentication & security rulesNext.js & full-stack JavaScript
Reviewed by

Kajal Pansuriya

Developer Educator, ShareCode

Developer educator at ShareCode. Writes the tutorial track — Python, JavaScript debugging, coding-interview prep, and the everyday code-quality habits that hold up in real codebases.

Python fundamentals & teaching beginnersJavaScript debugging & DevToolsCoding-interview preparationClean code & code review

Building an internal AI dev tool?

Prototype the retrieval and evaluation loop in a shared code space — paste the pipeline, the prompts, and a sample repository slice, and walk a teammate through where the harness gates the output before it ships.

Open a code space