Previously on Battlestar Galactica
In the previous episode, my (somewhat) trusty AI agent and I created an inference lab bench from scratch in Go. Scratch means the inference engine is a fresh implementation, not zero-dependency†. It also doesn’t mean no-reference (that would be insane). llama.cpp has been not only a crucial resource for implementation logic, but also a critical ongoing referent for execution equivalence.
The point of the exercise was not only to clear the haze around inference mechanics but also to build a place for further experimentation and exploration. Additionally, it seemed like this would be something others could use as their own jumping-off point. Granted, they wouldn’t have gotten the experience I did from building it, but the project is well-documented, fairly stripped down, and AI-friendly. Pointing your favorite coding agent at it and interrogating the architecture would be a much more focused tour of the inference process than doing the same for a production project like llama.cpp.
The highlights of the design are that the engine is strictly model agnostic, it implements all logic in Go‡, it has instrumentation for extracting state metrics, and it has convenient testing facilities for verifying that the effective inference has not diverged from llama.cpp’s baseline.
All model-specific data live in .arch.toml
files§.
They describe the execution graph for each model architecture.
Furthermore, the bench executable can produce SVG diagrams
of the overall architecture and layer details from the .arch.toml
files.
The OpenAI HTTP API served by bench mirrors
llama-server’s in supporting the extraction of log
probabilities. This is a diagnostic value from the process of decoding
logits to tokens. The bench and llama-server
values produced for the same model, same settings, same prompt, are
fuzzy-equal with differences 10x smaller than what would risk a token
change.
That wraps the recap. On to this week’s thrilling episode.
Act I: Gemma4
So far the lab bench supported Qwen3.5 (dense and MoE), Llama (dense), and DeepSeek2 (dense and MoE) architectures. Gemma4 had just come out and was making a splash, so that sounded like a good next candidate. This turned out to be a good addition not only because the model was interesting, but also because it showed up some soft spots in bench’s data-driven design. The work ended up being significantly more than just “write a .arch.toml file”.
Before getting going on that, a few pieces of tech debt were becoming more obvious, so a little housekeeping was in order. The most pervasive was that logging was handled with printing to stdout/stderr, and redirects from shell commands were creating log files. A structured logging system was past the point of being put-offable. This project’s logging needs were fairly basic, but that could change. The go-to answer for cases like this is a simple abstract interface that could then be backed by a local implementation. It’s an approach that provides the option of integrating a feature-rich, heavier dependency later. This is exactly the kind of design decision you have to specify to a coding agent. If you just tell it to “implement logging” or “integrate a logging package”, you’ll likely get the heaviest dependency out there with capabilities and complexity appropriate for deploying a fully scaled backend service completely entwined with your whole code base. Getting a reasonable outcome requires being specific about the need for an abstract logging interface, the requirement that this interface is the exclusive project logging system (no sneaking dependency-defined constants out into the code base), and that swapping implementations under the interface must have zero effect on the rest of the project code.
On the plus side, an agent provided with clear instructions does this kind of refactor quite reliably. It was easy to verify that there were no lingering print statements, and adding an invariant to AGENTS.md that the logger was the sole path for printed output has proven to be sufficient for maintaining consistency. I decided to stick with the simple, local logger implementation for now. It’s well under 100 lines of code and does exactly what’s needed and no more. If the project ever needs something more capable, the abstract interface allows for a painless swap.
Another lingering piece of tech debt was that the
integration-test make target’s result was eyeballs only.
With the logging system in place, it was trivially easy to add a scan
for error messages to the log file and exit 1 in the make
target steps when any were found. The test_equiv.sh script
got the same treatment, making it easier to catch when we “got lucky” on
an outcome in spite of something having gone wrong internally. This
arrangement allows for cheap runtime safety checks (i.e. “was this
interface’s contract violated by the caller” tests) to provide double
service. Not only do they prevent crashes at runtime, they act as
integration test checks.
A brief digression: unit tests are a great tool, but they are frequently misused. Hitting 90+% unit test coverage is not a security blanket, it’s a straitjacket. In chasing that last 30% or so of coverage, you end up with tests that do things like check whether a certain line was logged just to verify a code block was hit. You have a test suite that’s essentially a checksum of the comment-stripped code*, but with way more steps. It becomes impossible to make the smallest change without having to rewrite dozens of tests. Unit tests are for pieces that can be isolated (I love them for finicky math functions). And yes, you should design your systems to be testable, but as in all things, pragmatism needs to get a vote. Unit tests for interfaces, tricky logic, tricky math. Integration tests for how the whole system actually works. Reaching 90+% test coverage via a combination of unit and integration tests lets you build robust systems you can actually maintain. Excessive unit test coverage is a suit of armor with fixed joints that needs a plasma cutter and arc welder for every pose change. End of digression.
A last piece of infrastructure was needed before moving on to the fun part of adding the new model — memory usage tracking. Fortunately, ggml provided facilities for keeping track of consumed RAM and VRAM, so it was primarily a case of wiring up the bindings and adding some basic structs to hold the results. A few checks in key places should keep us from asking for too much memory when the cupboard is bare**.
All of this took about a day, but that kind of foundation-shoring work reliably pays off, crossing the break-even point into ROI territory within the first week.
Back to Gemma4. It turned out there was a hidden piece of model-specific cruft in the tokenizer — a complex regular expression used for finding special tokens during model loading had a hard-coded list of patterns related to known models. The fix for that was to just use the token list from the GGUF as the starting point and build the special tokens list using the token metadata. As expected, a new layer block definition was also going to be needed, but the current breakdown of blocks and graph construction logic were too simple. They were structurally based on the similarities of the models seen so far. It didn’t require an excessive amount of surgery, but it did indicate that an eye would need to be kept on complexity growth of the data-driven system. As new architectures were added, it would be extremely easy to slip into a “death by flags” anti-pattern where it became impossible to easily follow the logic of model graph construction. In fact, part of the refactor for Gemma4 was breaking up the block builder to avoid this exact scenario.
Another issue Gemma4 turned up was that some architectures supported
both dense and MoE models under the same name. Qwen3.5 has two
separately named architectures in the .gguf files: “qwen35” and
“qwen35moe”. Gemma4 used “gemma4” for both, and the dense/MoE
differentiation needed to be inferred from the model metadata (there
wasn’t a “is_moe: true” equivalent). This led to adding an
ffn_alt section to the .arch.toml file format so that both
configurations could be adequately described in one file. It also
cascaded to how the diagram builder worked. Now, one .arch.toml file
could produce two .arch.svg files: one for dense, one for MoE. In
theory, it would have been possible to just merge the dense and MoE
diagrams in a side-by-side configuration, but since the entire point of
that feature is to make it easy for someone to see what’s going on
inside a model, simpler visual output in separate files was the obvious
design choice.
All told, the Gemma4 (dense) implementation, with its required reworking passes, took about two days. Adding Gemma4 (MoE) support took about a half day, mostly due to the .arch.toml format change and diagramming code changes. The integration tests and llama-equivalence tests were passing, so now we had the cool, new model working in the lab bench. Adding it had taken more time than expected, but it had resulted in a more robust data-driven system. I’d take that as a fair trade.
If you want to see what the guts look like, here are the diagrams for the dense architecture:
| gemma4.arch.toml | |
|
|
|
Act II: LLaDA
Bench had support for multiple auto-regressive transformer architectures, but none for text diffusion. LLaDA was a model I’d come across early in the project’s development. I had downloaded it and gotten a .arch.toml put together that allowed it to load nicely but didn’t realize it used a completely different generation process than the other models until I tried to run it. One “d’oh!” later, I decided to park it since I was already in the middle of getting one big set of features working.
After having done the rework needed to get Gemma4 supported, it seemed like a good time to add text diffusion generation. This turned out to be another fine learning opportunity.
What quickly became evident about text diffusion is that it is substantially different from auto-regression in several key ways. First is that the token count of the response is fixed at request time. You also have to tell it how many times to refine its thinking before considering the job done. Lastly, you need to tell it how much of the response length to address each time you “turn the thinking crank”. The fixed-size response may actually have a bunch of pad tokens filling out the end, but the whole response buffer is computed no matter what. So if you specify a response size of 128 tokens but prompt with “answer 2+2=? in one digit.”, you’ll get “4<EOS><PAD><PAD>…”, which any reasonable server will truncate for you to just the useful part. But you will have paid the compute cost for all 128 tokens.
What this all boils down to is that you can’t just set a max response token count of 8K and forget about it. In auto-regression, you only pay the cost of the output tokens that are generated. You turn the crank and keep getting output tokens until the model tells you it’s done with an EOS (end of sequence) token. With text diffusion, the parameters you specify for response token count and step count dictate a fixed compute cost. And the extra cost is not cheap. Diffusion has a theoretical efficiency advantage for larger responses, but the crossover point is well beyond the range of short test questions. There is a reason text diffusion is an outlier architecture rather than the standard approach for frontier text models. The diffusion approach is actually great for image and audio generation, where the output size is known ahead of time and is well beyond the break-even efficiency threshold.
As for getting the lab bench support for LLaDA underway, there was a
minor hitch. Text diffusion is so much of an outlier that
llama-server does not actually support LLaDA. The llama.cpp
project does have an implementation .cpp file and a
llama-diffusion-cli utility for running text diffusion
models, so adding this model was not without a solid reference. The
llama equivalence test wasn’t going to be doable, however, because
llama-diffusion-cli does not provide a way of extracting
the logprobs stats. The equivalence check would have to use the less
reliable final text output instead. It also meant that the
test_inference.sh script was going to need some surgery to
maintain its functionality for the new model class.
The implementation process played out similarly to that for Gemma4 —
some adjustments to model construction logic, a small change to the
.arch.toml format. The main wrinkle was, of course, the new diffusion
execution path. Complete development time from cutting the branch to
diffusion producing sane output was about a day. The defaults in
test_inference.sh for text diffusion parameters (128 output
tokens, 32 diffusion steps, diffusion block length 64) allow for short
answers in reasonable time, but these are not values that make it
suitable for general use. Creating a heuristic system for tuning these
values adaptively to a given prompt would have been a bottomless rabbit
hole, so the fixed values will do for testing purposes.
If you look back at the Gemma4 layer diagram, you’ll note that the diffusion layers diagram looks very much like the auto-regression version. The difference is in how the graph is executed.
| llada.arch.toml | |
|
|
|
Act III: Safetensors
In adding the previous architectures, one point of friction that came
up was that finding GGUF equivalents of the original models as released
by the creators was not always possible. I could just use llama.cpp’s
utility for going from huggingface.co’s Safetensors directories to GGUF
and call it done, but the next capability I wanted for the lab bench was
handling the “test” part of the modify/test cycle for model
manipulation. Again, convert_hf_to_gguf.py would work, but
it’s not a quick process. If there’s one thing I’m a fiend about, it’s
iteration speed.
The approach I chose was to create an abstract interface for model loading. It would present the same surface area as the existing GGUF loader, and the GGUF loader would just be one implementation of that interface. The abstract interface would use the same symbol names from the .arch.toml files that the GGUF loader had been using to this point. This would allow all of the model graph construction and execution code to remain unchanged.
One of the things that fell out of this was that a new TOML file was going to be needed that mapped names that worked for the GGUF loader to names found in the Safetensors metadata. It also turned out that GGUFs contain some derived values (both numerical and tensor) that had to be either stored in the new .arch.stmap.toml files or computed at run time by the new loader.
This was not an easy lift. A compounding difficulty was that Anthropic was having a meltdown. There were actually multiple cache invalidation bugs they rolled out in sequence. It was not a good time for them or their users. I tried using another agent/model combo and it was ultimately more expensive and less effective. Once the dust settled, I switched back to Claude Code.
What made the preceding particularly painful was that this was finicky, complex work of the type that doesn’t have thousands of training examples baked into coding models. The agent needed to be closely watched and carefully guided the whole way. Once the first pass at the Safetensors loader was done, it was easy to verify that GGUF loading functioned as before. The architectural strategy worked as intended. The same could not be said for that first pass at Safetensors loading. It crashed catastrophically, even freezing my machine during bug hunting. What became clear was that I needed to put a pin in the loader work to do a full runtime safety pass. The memory tracking work done during the Gemma4 support effort had closed a few gaps, but obviously not enough of them.
Two architecture reviews & burndown lists later, there were more systems in place for preventing catastrophes. The biggest was that the model loader interface now had a method for estimating minimum memory requirements for a given model. That, combined with a check against available resources, prevented terminal overdrafts. Runtime safety checks were also added to the ggml bindings. These checks validated all required parameters and errored out appropriately on failure to meet the contract.
Once the runtime safety work was done, the Safetensors loader quest
resumed. The model graph construction catastrophes were tracked down and
eliminated. Then came the smoke tests — good ol’ “2+2=?”. Apparently
“2+2=kwyjibo”? That’s not right. The bug hunting went from “please don’t
die” to “please stop being incoherent”, which was a definite step up.
Back to the salt mine. There are a number of data digestion steps
convert_hf_to_gguf.py performs during its conversion.
Getting them implemented in Go to work at load-time was more than a
little touchy. There was an endless sequence of “just one more fix”
moments.
Ultimately, after a fair amount of swearing and a great deal of
stubbornness, model-agnostic code was able to correctly load Qwen3.5-9B
(dense) Safetensors via the .arch.stmap.toml file. Or apparently
correctly, at least. Inference produced coherent output, so the next
step was to verify equivalence with the GGUF-loaded version. An
additional gguf-st mode was added to test_equiv.sh to show
whether or not I’d stuck the landing. The test passed on the first try,
and I will not lie about my reaction. Not all expletives are angry.
Safetensors loading was over the hump. The next step was getting another architecture working. I chose LLaDA because, being substantially different, it was most likely to reveal additional requirements. It turned out that a description of certain types of transformations to be applied to tensors was needed, so this got added to the .arch.stmap.toml file format. This wasn’t 15 minutes, but it took significantly less time than was needed to get the first model working. Two hours was a notable improvement over two days.
There are some major verdicts still TBD on this effort. Will adding Safetensors loading for further models be one painful architecture-modification slog after another? Will running models straight from their Safetensors ultimately pay off during model manipulation experiments? If those questions turn up snake eyes, this code will be removed (don’t throw good tokens after bad). Either way, the educational value was an unqualified win, so I’ll take that and run.
Tune in For Next Week’s Thrilling Conclusion
Among the many lessons learned on the whole lab bench journey is that auto-regressive transformers produce inference in two phases. The first is prefill, the second is decoding. Prefill is done by converting your text prompt to tokens (elements of the model’s internal “language”). Each token is used to look up a vector from an embedding table (it’s part of the model but not in the graph). Those vectors are stacked up to create the embedding matrix, which is fed through the entire graph of the model’s tensors. This process pre-loads the KV cache¶, preparing the live model for the decoding phase. In other words, your question gets turned into a big matrix, that matrix is pushed through the machinery, and this “primes the pump”.
At the end of the prefill process, a single output token is produced, and it is used to start the decoding phase. That first token is stored in an output buffer and then used to look up a vector from the embedding table (the same one used to make the embedding matrix — tokens have to become vectors to get processed). This found vector is then run through the tensors of the model graph‖ and another token is produced. Like the preceding token, it is stored, used to look up another vector, and the process continues. It is stopped when the model produces a special EOS (end of sequence) token. From there, the collected output tokens are converted from the model’s internal language back to a human-readable form (text, in this case)††.
With the tooling in one hand and the squishy wetware-cached knowledge in the other, the next step is to try some “what the hell, let’s see if this does anything” experiments. Whatever happens, it’s bound to be informative. I’ll let you know how that works out.
Footnotes
† ↩︎ ggml, gguf-parser-go, gonja were all indispensable
‡ ↩︎ CGO is used only to bind ggml. There is no utility code in C. Keeping the binding layer maximally thin is a design invariant.
§ ↩︎ Thank you, BurntSushi/toml
* ↩︎ And about as informative.
** ↩︎ The “shouldgun” over the fireplace…
¶ ↩︎ The KV cache mechanism is a form of memoization. It stores a queryable table of vectors that are the results of previous calculations. Without it, decoding becomes an O(n^2) operation, meaning each doubling of output token count would produce a quadrupling of compute effort.
‖ ↩︎ “Run through” is a definite hand wave. This paragraph needs to end sometime and the level of detail presented was enough for getting on with.
†† ↩︎ This is why inference providers charge for API usage with rates measured in both input tokens and output tokens. It is also why output tokens are generally billed at a substantially higher rate. All of the input tokens get processed via the embedding matrix in one pass through the model’s tensor graph. Each output token comes at the cost of a complete “turn of the wheel”, making it more expensive.