2026-03-31  ·  golang  ·  ai  ·  inference  ·  tooling  ·  learning

Building a Scratch Inference Engine in Go

The School of Dirty Hands is Always In Session

Inference is Kinda Like…You Know What, I Don’t Know

There are a number of analogies for the mechanics of inference I’ve seen, and they were mostly useful for making the point that LLMs are not sentient and are highly unlikely to become so. But beyond that? Hand-wave city. As a kid I was the despair of my parents’ possessions, as my idea of figuring out how something worked was to take it apart. Fortunately, this curiosity/methodology combo was limited to the mechanical rather than the biological. Computer science became an acceptable avenue of exploration, much to the anthropomorphized relief of every complex object in the house.

One of the nice things about software is that it is infinitely malleable. You can break it apart, put it back together, reorganize it, and actually improve it instead of just having a towel covered with parts that don’t quite want to go back where they came from. Inference was a hazy, sealed container that seemed ripe for disassembly and examination.

The Right Tools

The first question was “what’s out there already?” There might be an inference library that I could use as a starting point. The canonical answer appeared to be llama.cpp. It’s implemented in C++, unsurprisingly, and is a complex, production-grade project. This was not exactly great interactive tinkering material. C++ build times, lots of extraneous features, optimizations, and details - I’d be learning about the structure of llama.cpp, not inference. However, using it as a reference for core logic and a working example to build my own engine looked appealing.

In the past few months my language of choice for new projects has been Go. Not only do I really like the honed, restrained syntax, it has a number of useful characteristics that make the first question “is there some reason Go would be a bad idea?” rather than “what language is the best?” Go is a mature, capable language that presents a fantastic combination of low cognitive overhead, high runtime performance, an outstanding package manager and extensive ecosystem, and nonpareil cross-platform support and deliverability (build for all platforms from any platform? sure, why not?).

GPU acceleration was a non-negotiable requirement. Thirty minutes between prompts doesn’t allow for exploration. While directly using Georgi Gerganov’s llama.cpp or a binding to it wasn’t on the table, another of his libraries, ggml, looked like a great fit. It was purpose-built for machine learning GPU acceleration (llama.cpp uses it), and the API could easily be bound to Go via a thin pure C layer (another nice language feature).

Makefiles are a natural for maintaining dependencies in a Go project, so that rounded out the preliminary decisions. It was enough to be going on with. The rest of the choices would need to be discovered at the ore face.

A Most Exciting Answer to the Perennial Question 2+2=?

Even using an AI agent, it took two days of steady work to get the first model (a quantized Qwen3.5 variant with four billion parameters) working end to end.

From the beginning it was clear that surfacing inference via an OpenAI-compliant REST API was going to be the most flexible choice. Since the base Go installation includes a production-grade http server package, this wasn’t even a speed bump for the coding agent. The skeleton of the project came together quickly. The inference engine itself was where things were going to get interesting.

The initial implementation was actually via a C++ sidecar with Go bindings to integrate it into the rest of the project. This was a transitional phase selected for a specific reason: Inference is very finicky code. Minimizing potential points of confusion at each step of the way was going to be crucial. So starting with a C++ implementation that drew from llama.cpp to extract the loading and execution logic would be the most direct, least distorted (if expensive in tokens due to C++ verbosity) path. It took a day and a half to get the first non-garbage response from the model. One of the many things learned was that inference requires repetitions of expensive calculations, and memoization (caching) was critical. Getting the first coherent reply wasn’t the end of the implementation. It took another half day of work to get KV caching working. I preserved the stateless (non-caching) code path to maintain a baseline result against which to compare the caching code path whenever it changed. This is a general pattern that pays dividends on any project: if there’s a simple way of doing something that’s easy to maintain and a fast way that’s more complicated, keep the cheap, slow version around. It keeps the fast path honest so you can tell when you’ve “fixed it so good it don’t work no more”.

So the moment arrived where, in the immortal words of the mad doctor, “It’s alive!” It’s rare to be so happy about the result of simple addition, but there it was. This was just the first step, though. A C++ sidecar tacked on to the Go code base was more than a little ugly and certainly never meant to stay.

The next step was taking the working loading and execution logic (reduced in complexity after extraction from llama.cpp) over to Go. This is where the Go ggml binding would be critical§. Before this point ggml was being consumed directly by the C++ sidecar. Interestingly, this went fairly quickly — about another half day. The C++ sidecar version of inference was used to A/B test until behavior was identical, and then ditched. Unlike the uncached inference path, this was not cheap to maintain. I did, however, bring the stateless (non-caching) inference path over to Go.

Room for Improvement

So now Qwen3.5 (dense) was working — what about Llama or DeepSeek? Looking into this revealed something that the “just get one thing working” focus had obscured up to this point. A GGUF file describes a great deal about a model and contains all of the weights, parameters, and so on. But it does not describe the structural graph by which data flows through those pieces. A GGUF is an auto parts store shelf with things labeled and neatly stacked, but without anything like the assembly diagram to make a running vehicle. That is out-of-band data that lives in the implementation of each inference engine.

This raised the obvious question: how does llama.cpp represent the structural data? For all its production-grade features, it uses a one-off C++ file for each architecture it supports. There’s very little in the way of anatomical abstraction. I didn’t want to just replicate this pattern in Go and decided it was worth a detour. The clear answer was a data-driven, model-agnostic system that would build up the execution graph for an architecture from blocks. The description of the arrangement of those blocks would come from a data file in a domain-specific language (DSL).

TOML was a natural choice for this DSL. As with Go, for data language needs the first question is “why shouldn’t I use TOML?” It is much more human-friendly than JSON, supports comments, and has lower token overhead for LLMs. Outside of cases where JSON is a hard requirement (like OpenAI API payloads), there’s just no good reason to use JSON unless the data is being loaded straight into a JavaScript interpreter. It took about a half day to break up the model loading/graph building logic from the model-specific implementation to reusable blocks. A .arch.toml file referenced the blocks and described how they were wired together to make Qwen3.5 work. As with the C++ model builder, the known working implementation was used to validate the new approach. Also like the C++ model builder, as soon as the new approach was working the old one was dispensed with. It, too, was not reasonable to maintain, especially since the explicit design goal was to have zero model architecture specific code. All model specificity would live in the .arch.toml files.

There was an additional benefit to the data-driven approach. Automatic generation of diagrams from the model architecture graph definition became possible. For each block type defined in code, an SVG snippet could be created that visually described it. Writing a utility that combined the architecture graph from the .arch.toml with the SVG snippets allowed for automated graphing of model architectures. This wasn’t just neato, it was crucial to increasing my understanding of how these beasts worked.

qwen35.arch.toml

The ultimate payoff came when adding new model architectures. This can still incur an additional code file — but it would be a reusable block implementation, not a whole-model one-off. Adding support for Qwen3.5 MoE (mixture of experts) took a couple hours. In part, this was due to ironing out kinks in the brand new data-driven system. Adding support for Llama (a fairly straightforward architecture) and DeepSeek2 (with its “exotic” attention mechanism) took two hours total. Llama was done in twenty minutes, DeepSeek2, including some debugging and “what the hell?!” time, was about an hour and forty.

Validation and Rigor

We have an efficient, flexible, data-driven system that runs inference for different model architectures. It has the ability to make pretty pictures out of them, too. But is it really working, or just faking it good enough for “what’s 2+2?”. The answer to that can be obtained by using llama-serve as a known good reference implementation. Both it and the lab bench API server support the collection of diagnostic data. Collecting that data while posing the same prompts to the same models run by both systems provides the material for a baseline comparison. There’s a make target equiv-test in the project that runs exactly this test. For Qwen3.5 and Llama models we have results that differ by only floating point noise. That is to say: they are equivalent inference implementations.

It’s worth noting that the use of a coding agent was tremendously helpful in providing domain expertise I lacked, but there was considerable overhead in managing it. Creation and maintenance of AGENTS.md and ARCHITECTURE.md were crucial for keeping the coding assistance on track between context resets. Diligence was required as the agent would try to take shortcuts (writing code specific to whatever model was in use at the time) or use lazy anti-patterns (violations of separation of concerns, DRY, and such). Even with good practices, it was easy to miss things. I did several architectural review passes using agentic assistance and turned up a few dozen issues that required cleaning up. This all occurred while using a premium frontier model on a paid basis.

If this had been a vibe-coded toy I wouldn’t have bothered, but the intent was to reuse this project as well as share it, so there were standards that had to be kept.

Well That Was Fun, So Why Stop?

This project is definitely not done and dusted. There are more model architectures to handle as well as the Linux and Windows operating systems to support. While this was developed on a Mac, none of the primary code is Apple-specific. The ggml library is cross-platform, but there are platform-specific functions for state initialization that need to be bound for the other platforms and used from the Go side.

If you’d like to use it as a starting point for your own inference experiments, it’s available on GitHub. Break it apart, put it back together, have fun.

Footnotes

† ↩︎ That low overhead extends to people and agents. Python has dynamic hazards which require attention to keep track of. C++ is drastically complex, again, consuming brainpower and tokens just to manage it rather than solve problems. Go can be more verbose than Python, but adherence to DRY principles keeps that well under control. The entire go-inference-lab-bench project is under 6500 lines of Go with just over 250 lines of thin C wrapper for ggml.

‡ ↩︎ Another lesson learned during this project: two classes of LLM architecture are “dense” (whole model evaluated for all prompts) and “mixture of experts” (prompts are “triaged” and internal portions of the divided model are selectively used to reduce compute overhead). The Qwen3.5 family of models has one of each.

§ ↩︎ Technically, CGO could consume ggml’s pure C API, but it has hundreds of types and functions. Isolating just the ones we need and hiding all the concrete types as opaque void* makes for faster CGO processing, consistent handling of “unsafe” pointer types, and a general reduction of complexity.

‖ ↩︎ GGUF is a flexible format for storing model parameters. It supports multiple numerical types - FP32, FP16, FP8, as well as block compressed quantization formats. The quantized formats are the LLM tensor equivalent of graphics API texture compression formats like BC7.

¶ ↩︎ The “logprobs” parameter in the OpenAI API allows for asking for exactly this kind of diagnostic data. The go-inference-lab-bench API server implements this functionality so the comparison to llama.cpp can be made.