The Latest Open-Source Models Disappoints Me

As of May, I’ve had time to experiment with some of the newest Chinese open-weight model releases, such as Kimi K2.6, DeepSeek V4, and others. I have to say, I’m pretty disappointed with this latest batch of models. I’ve encountered two very glaring problems that defeat the purpose of using them over a frontier model like GPT-5.5 in low-reasoning mode: over-eagerness to help and excessive use of reasoning tokens.

Let’s start with the most obvious issue: reasoning tokens, especially in Kimi K2.6. Despite the impressive tokens-per-second performance these open-weight models achieve across various providers, you often end up waiting an absurd amount of time for the model to “reason.” Initially, I assumed the long reasoning traces were due to the complexity of my problem. I had these models attempt to write a Termios handler in C# after encountering an issue where the size of Termios differs between macOS and Linux.

However, after attempting various runs with each model, I saw the same chain of thought that would spiral in unproductive directions. This resulted in the models consistently taking over four minutes just to write a simple library import. After reviewing benchmarks from Artificial Analysis, it became clear that the issue was not the prompt or the model setup. In their benchmarks, Kimi K2.6 produced 156,531,946 reasoning tokens compared to 9,014,992 answer tokens in max reasoning mode. That is about a 17:1 ratio, compared to GPT-5.5 xhigh coming in at 67,662,076 reasoning tokens to 7,525,826 answer tokens, or roughly a 9:1 ratio.

Unfortunately, the excessive reasoning is only one of the issues with these models. Another notable problem I’ve observed is their over-eagerness to help. Very frequently, I see the same loop despite the prompt: search the codebase for files to target → inspect the relevant file → notice an unrelated missing implementation → create a solution for the original prompt while also implementing the unrelated functionality.

This happens regardless of whether the prompt depends on that missing implementation. While this may initially seem like a bonus, in many of my use cases the missing implementation is critical and not something I am comfortable delegating to AI. Depending on the AI harness and whether a revert option exists, this can quickly become an awful experience.

This often forces me to review unrelated code changes in an attempt to restore the original implementation. In my experience, asking a model to undo its last changes is effectively a gamble: either it partially reverts the previous turn, or it ignores the instruction entirely and resets back to the last commit, wiping out all changes made during the session.

Ultimately, these models feel like they have been heavily reinforced and fine-tuned for agentic coding, though I cannot say that with certainty. I do not have the full knowledge or experience to make that claim definitively. What I can say is that these models are not yet mature enough to replace frontier models, but they are a cheaper alternative for certain agentic use cases outside of programming.

Personally, I am going to stick with GPT-5.5 in low-reasoning mode for programming and use DeepSeek V4 Flash, now priced at $0.14 in and $0.28 out, for search, documentation, and OpenClaw.