Local LLMs are Cool

I have been experimenting with a ton of frontier models and open source models like GPT 5.4, Opus 4.6, Composer 2, MiniMax M2.7, just to name a few. All of the models have been great but lately new models have just been rather minor intelligence improvements. That’s not to say that the minor difference isn’t significant but for me they have lost their novelty. So I opted to take a different approach and I looked into models I can run locally.

Now in terms of computer specs, I run the following setup: Ryzen 3700x, 32GB of DDR4 Corsair Vengeance running at 3200 MHz, a NVIDIA Geforce RTX 3060 TI Tuff Edition, with a 1TB EVO 980, on Fedora 43. Now, I do not have the latest and greatest specs but I have a solid rig and I was amazed that my slightly outdated hardware could run some local models.

For reference, I spent a good weekend discovering what models were possible to run on my hardware. In this short time I got a chance to experiment with Open Hermes 2.5, Gemma 3, and Mistral on Ollama. However, I ended up settling for Gemma 3 e2b as it was the only model that I can get a reliable token per second (TPS). I was pretty impressed with the idea of running a model locally after this experience using a private fork of Fabs Chat (yet another shameless plug). Despite my initial impression I was relatively bored of just a chat model that could not reliably tool call, until Gemma 4 dropped earlier this week.

I was blown away with a model that can reliably tool call and use web search running on my local machine, that only cost me pennies per web search. However, after a few turns into a session, I noticed a pretty noticeable degradation in performance. Despite this setback I discovered through my journalctl logs that this stemmed from my KV cache not being able to fit in VRAM, causing it to spill into system RAM and dragging the CPU into inference. Ultimately, this led me to introducing a pretty rough patch that only remembers the last ten messages in a conversation in a current session to prevent the model’s performance on my machine from degrading.

Despite the rough chat experience I was inspired by the potential of these local models to make smaller tools powered by AI. My goal with my next project Fabs Review is to make a lightweight code review inspired by services like Code Rabbit or Greptile. I want to see how far I can push these models while being restrained by my current hardware, and whether the result is good enough to finally justify cancelling another subscription.