What Learning Zig Taught Me About Harness Engineering

Remember when you had to start a new chat in Claude because you’d hit the context limit? You’d been building up context for an hour: the codebase structure, the decisions you’d already made, what the model had tried and ruled out. Then the window filled and it was gone.

The context window is fixed. Tokens accumulate and stay — they don’t scroll out as new ones arrive. The context just fills. When it hits the limit, most interfaces compact: they summarize the oldest content to make room. The summary is cheaper to hold than the original, but it’s lossy. Details that didn’t make it into the summary are gone. Either way, the conversation becomes shallower as it gets longer.

The problem with that model is that compaction happens reactively, at the limit, with no awareness of what’s about to be lost. Treating the context as a ring buffer gives you a different handle on the problem.

A ring buffer is a circular data structure with a fixed capacity. Write to the tail. When it fills, the tail wraps around and overwrites the head. The critical question, the one most implementations skip, is what to do at the wrap point.

Wait — what does any of this have to do with Zig?

Ring buffers. Pointers. Cache hierarchies. These are systems programming concepts. If you came here for AI engineering, it’s a fair question.

Several months of building in Zig is the answer. It didn’t give me patterns to apply. It changed what comes to mind first.

When I hit the context compaction problem, my first instinct was: store the content somewhere and keep a reference. Not: what do other harnesses do? Just: don’t throw away the bytes. That instinct came from months of manual memory management, where you learn that pointers are cheap and content is expensive. When I had to use an LLM to analyse a document bigger than an entire context window, I thought: fixed budget, circular overwrite, flush before eviction. The Zig vocabulary came first.

Pointers and document IDs

TypeScript doesn’t have pointers. Neither does Python, or JavaScript, or most languages people use to build AI tooling. You pass objects around and the memory model stays invisible: the language never asks you to decide whether you’re holding a value or a reference to one.

Zig asks every time. Value or pointer: the type signature says which, the compiler enforces it, and you feel it immediately when you get it wrong. After enough of that, reaching for a reference instead of copying content becomes automatic.

That reflex fired when I hit the context compaction problem. The naive approach is to dump the full document content into context every turn: it’s available, it’s simple, the model has what it needs. But dumping content into context is passing by value. It fills the ring buffer faster. When the buffer overflows, compaction kicks in. Compaction is lossy. Information that was there is now gone, degraded, or summarized past usefulness.

Storing a pointer instead breaks the chain before it starts. Store the document in SQLite or a key-value store. Embed the ID in context. Give the agent a retrieval tool to dereference it when it needs the content. The ID costs a few tokens, stays stable across turns, and doesn’t age out. The content lives outside the ring buffer entirely. Multiple agents or conversation turns can reference the same document without any of them holding a copy. You only pay the retrieval cost when you actually need the bytes.

I wrote about implementing this in more detail here. But the instinct came from Zig, from having to think about the difference between holding content and holding a reference to it until the distinction became automatic. In a language that hides the distinction, you never develop the habit of asking the question. You embed content into context without it registering as a choice. In a context window with a hard limit, you pay for that later.

Ring buffers and conversation memory

When a chat interface hits the context limit, it compacts: summarises the oldest content to make room and keeps going. Anything that didn’t make it into the summary is gone. This is reactive compaction: it fires at the limit, with no designed boundary and no control over what gets compressed.

The ring buffer model gives you a different handle. You maintain a head and a tail pointer. New messages write to the tail and advance it. When the tail catches up to the head, you’re about to overwrite the oldest content: that’s the wrap point. In a naive implementation, you just overwrite. But the wrap point is a boundary you can design around.

You could flush before the wrap.

head

External store

Empty buffer

An empty ring buffer. The tail writes incoming content; the head marks the oldest region, which gets overwritten first when the buffer fills.

1 / 6

Empty

Filled

Wrap point

Flushing

New write

Before a region gets evicted, you know it’s about to be evicted. Summarise that region, store the summary somewhere accessible, then evict the originals. Three things are different from reactive compaction: the timing is proactive, the scope is bounded to the region about to leave, and you decide what the summary contains. Recent context stays verbatim. You only pay the compression cost at the designed boundary, not continuously, and not for content that doesn’t need it yet.

You could go further and skip the summary entirely — flush the full content to external storage and keep a pointer in context instead. Retrieval costs a tool call, but nothing is lost. The loss is a choice, not a requirement of the architecture.

For memory that persists across sessions, this extends further. Summaries from multiple wrap events feed into a longer-term store: verbatim recent context at the top, progressively compressed older context below. A ring buffer feeding a ledger. I haven’t built this. But the structure follows from the primitive.

Ring buffers and sequential corpus analysis

You need to produce an LLM-powered analysis of a multi-day congressional hearing. Testimony runs to thousands of pages. A witness on day 10 refers to evidence from day 3. You have a 1M token context window, which sounds large until you do the maths: it won’t fit. What do you do?

The obvious answer is to process each day in isolation and combine the results. That’s map-reduce, and map-reduce loses the sequential dependencies. Day 10 testimony without day 3 context is a different document. The analysis is wrong in ways the model won’t notice.

The approach that works: maintain a fixed-budget evolving markdown. For each day of testimony, update it with key claims, named entities, cross-references, emerging themes, what shifted relative to prior testimony. When the markdown exceeds budget, flush the about-to-be-evicted region into a denser summary before overwriting. Process the next day’s testimony with the current markdown as context.

The system can read a document longer than its own memory. The oldest detail from three days ago may be gone, but the important threads were compressed at the eviction boundary rather than just dropped. Explicit lossy compression, with full awareness of where the loss happens.

This beats map-reduce because the loss is designed, not because it’s lossless. You can inspect the budget, tune the eviction criteria, decide what gets written into the markdown before something gets dropped. Map-reduce buries the same tradeoff and calls the result a summary.

Cache hierarchy and context tiers

L1 cache is small and directly accessible. L2 is larger and slightly slower. L3 is larger still. Fetching from main memory is orders of magnitude slower than L1. A cache miss means a fetch from the next tier down, with real latency.

The context window is L1. Whatever lives there, the model can use directly. Retrieval (a vector search, a tool call, a lookup) is a cache miss: costs a round-trip, adds tokens, slows the response.

The tiers map cleanly:

L1 (context window). Fast, tiny, expensive per token. What the model currently holds.
L2 (retrievable summaries). Slower to access, larger, compressed. Structured reference pulled on demand.
L3 (full document store). Slowest, unlimited. The source of truth.

A cache miss is a design parameter, not a failure mode. In hardware, you architect access patterns to keep hot data in L1. In a harness, you make the same choice: which information needs to live in context at all times, which can sit in L2 until needed, which can stay in L3 until explicitly requested.

Hallucinations are often a cache miss problem. The model needs a specific fact, it’s not in context, and instead of returning uncertainty it fills in something plausible. Sometimes the fix is architecture: get the right information into L1 at the right time.

An agent debugging a test failure three hours into a session references a config value from a file it read at the start. That file was evicted from context twenty tool calls ago. The model doesn’t know it’s working from memory: it doesn’t have the data. The answer it produces is subtly wrong in a way that’s hard to trace back. A harness designed around the cache hierarchy would have kept that file in L2 and re-fetched it at the right moment. Same model, different architecture, different result.

Why this matters

Systems programming builds vocabulary for constraints that show up everywhere: bounded memory, bounded compute, the cost of crossing a boundary, what happens when a buffer overflows. These aren’t specific to hardware. They’re properties of any system where resources are finite and operations can fail. LLM harnesses have every one of them.

AI engineering is new enough that this vocabulary is still being developed. AI engineers keep rediscovering problems that systems engineers named and solved decades ago. The vocabulary is still forming. But there’s a lot of free prior art in operating systems textbooks and distributed systems papers that applies more directly than it looks.

The vocabulary starts to feel natural. The constraints rhyme, and you reach for the primitive before you’ve named the connection.

The only path to that is building the primitives by hand, without the abstractions that make them invisible. Use every abstraction available to ship fast. But understand the layer below, somewhere. You won’t see it until the constraints bite. And then it’s already there.