TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.
The 20B model runs on my Mac laptop using less than 15GB of RAM.
I tried to generate a streamlit dashboard with MACD, RSI, MA(200). 1:0 for qwen3 here.
qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with a fully working dashboard, graphs, and recent data fetched from yfinance.
gpt-oss-20b mxfp4's code had a missing datatime import and when fixed delivered a dashboard without any data and with starting date of Aug 2020. Having adjusted the date, the update methods did not work and displayed error messages.
for now, i wouldnt rank any model from openai in coding benchmarks, despite all the false messaging they are giving, almost every single model openai has launched even the high end o3 expensive models are absolutely monumentally horrible at coding tasks. So this is expected.
If its decent in other tasks, which i do find openai often being better than others at, then i think its a win, especially a win for the open source community that even AI labs that pionered the hype of Gen AI who didnt want to ever launch open models are now being forced to launch them. That is definitely a win, and not something that was certain before.
NVIDIA will probably give us nice, coding-focused fine-tunes of these models at some point, and those might compare more favorably against the smaller Qwen3 Coder.
>I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs
I have a bunch of scripts that use tool calling. Qwen-3-32B handles everything flawlessly at 60 tok/sec. Gpt-oss-120B breaks in some cases and runs at mere 35 tok/sec (doesn't fit on the GPU).
But I hope there's still some ironing out to do in llama.cpp and in the quants. So far it feels lackluster compared to Qwen3-32B and GLM-4.5-Air
The space invaders game seems like a poor benchmark. Both models understood the prompt and generated valid, functional javascript. One just added more fancy graphics. It might just have "use fancy graphics" in its system prompt for all we know.
still, if you ask this open model to generate a fancy space invaders game with polish, and then ask the other model to generate a bare-bones space invaders game with the fewest lines of code, I think there's a good chance they'd switch places. This doesn't really test the models ability to generate a space invaders game, so much as it tests their tendency to make an elaborate vs simple solution.
Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.
The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.
Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.
I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.
Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.
You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...
Illustrative extract:
> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.
> When direction changes, move all aliens down by step (e.g., 10 px).
The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).
12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1
Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.
As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896
So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.
If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.
If the models are memorizing and regurgitating from their training data, how come every model I've tried this with produces entirely different code?
Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!
I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.
Transformer does not need to emit a byte for byte clone of a training example to benefit from having seen it. It can store a distributed representation of many near duplicate implementations and then sample a novel linear combination. That still short circuits algorithm design so the burden of discovering the game loop, collision logic, sprite sheet etc. was ALREADY SOLVED during pre training.
When you temperature sample the same model twice you also get "different" code, diversity alone is not evidence of new reasoning. What matters is functional novelty under controlled transformations (renamed variables, resized canvas, obfuscated asset file names etc). On such metamorphic rewrites, models that appear brilliant on canonical prompts suddenly collapse, a hallmark of shallow pattern matching.
The paper I mentioned in my previos comment shows SOTA coding LLMs scoring 70%+ on SWE bench verified yet dropping 10–47% when the very same issues are paraphrased or drawn from unseen repos, even though the task semantics are identical. That is classic memorisation, just fuzzier than a CRC match.
As to qwen, even at 4 bit per weight, a 4B model retains ≈ 2.1 GB of entropy so enough to memorise tens of thousands of full game loops. The reason it garbled the alien movement logic is probably that its limited capacity forced lossy compression, so the behaviour you saw is typical of partially recalled code patterns whose edge cases were truncated during training. That’s still interpolation over memorised fragments, just with fewer fragments to blend.
And this is something that is actually proven (https://arxiv.org/abs/2406.15720v1) by controlled fact memorisation studies and extraction attacks up through 70B params show a monotone curve so basically each extra order of magnitude adds noticeably more verbatim or near verbatim recall. So a 20B model succeeds where a 4B one fails because the former crossed the "capacity per training token" threshold for that exemplar. So nothing magical there.
Don't get me wrong, I’m not arguing against interpolation per se, generalising between held out exemplars is precisely what we want. The problem is that most public "just write space invaders” demos never verify that the endpoints were truly unseen. Until they do, a perfect clone is compatible with nothing deeper than glorified fuzzy lookup.
This is a great explanation, thanks for putting it together.
It more or less fits my fuzzy mental model of how this stuff works.
I'm completely fine with my test prompt taking advantage of this - the point of "implement space invaders" is to explore how well it can construct a game of that shape based on the examples that it has seen in its training data, especially in comparison to other models.
I'm not trying for a test of ability to produce a unique new game - I want a short prompt that gets it to output some HTML and JavaScript that I can then interact with.
There is no way that gpt-oss-120b can beat the much larger Kimi-K2-Instruct, Qwen3 Coder/Instruct/Thinking, or GLM-4.5. How did you arrive at this rather ridiculous conclusion? The current sentiment in r/LocalLLaMA is that gpt-oss-120b is around Llama-4 Scout level. But it is indeed the best in refusal.
> TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.
That's just straight up not the case. Not sure how you can jump to that conclusion not least when you stated that you haven't tested tool calling in your post too.
Many people in the community are finding it substantially lobotomized to the point that there are "safe" memes everywhere now. Maybe you need to develop better tests that and pay more attention to benchmaxxing.
There are good things that came out of these release from OpenAI but we'd appreciate more objective analyses...
> I’m waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it’s likely that OpenAI now offer the best available open weights models.
You told me off for jumping to conclusions and in the same comment quoted me saying "I think OpenAI may have taken" - that's not a conclusion, it's tentative speculation.
I did read that and it doesn't change what I said about your comment on HN, I was calling out the fact that you are making a very bold statement without having done careful analysis.
You know you have a significant audience, so don't act like you don't know what you're doing when you chose to say "TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs" then defend what I was calling out based on word choices like "conclusions" (I'm sure you have read conclusions in academic journals?), "I think", and "speculation".
> I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.
IMO, the "I think..." bit could be ambiguous and read as, "In my opinion, OpenAI may have...".
I agree with you it's not a hard/bold endorsement but perhaps leading with the disclaimer that you're reserving final judgement could assuage these concerns.
My llm agent is currently running an experiment generating many pelicans. It will compare various small model consortiums against the same model running solo.
It should push new pelicans to the repo after run.
The horizon-beta is up already, not small or opensource but tested it anyway, and you can already see an improvement using 2+1 (2 models + the arbiter) for that model.
What did you set the context window to? That's been my main issue with models on my macbook, you have to set the context window so short that they are way less useful than the hosted models. Is there something I'm misisng there?
Nice test. I tested it with llama.cpp and the 20b model. He got the answer correct at first try on my side. The internal/thinking dialogue is fun to read - it could be my own inner dialogue ;)
```
> "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???
<|channel|>analysis<|message|>The question: "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???". It's a common riddle. The answer is Bob? Actually "Bob's dad has five daughters: Lala, Lele, Lili, Lolo and..." The riddle is to
trick: it's "Bob" himself is a daughter? Wait Bob's dad has five daughters: Lala, Lele, Lili, Lolo, and Bob? But Bob is a name often male. But the riddle says Bob's dad. So Bob is a person who has a dad, and that dad has five daughters.
But Bob could be the fifth daughter? That would be weird: Bob is likely the name of the fifth daughter. So answer: Bob. But sometimes the answer is "Lulu" or "Bob's sister"? Let's think. The riddle is a trick: It's "Bobs dad has five d
aughters: Lala, Lele, Lili, Lolo, and ... Bob." Because the riddle begins with "Bobs dad has five daughters" implying Bob is one of the daughters. So the missing daughter is Bob. So answer: "Bob".
Thus answer: Bob. But sometimes the riddle goes "Bob's dad has five daughters, Lala, Lele, Lili, Lolo, and ... Who is the fifth?" The answer: Bob (the name of the person speaking, "Bobs"). But in riddle it's "Bobs dad has five daughters
, Lala, Lele, Lili, Lolo and ???". The missing name is "Bob." So answer: Bob.
Thus the answer: Bob. Or maybe it's "Boba"? But no.
Hence answer: Bob.
We'll answer accordingly.<|start|>assistant<|channel|>final<|message|>The missing daughter is *Bob*. The riddle starts with “Bob’s dad …” so Bob himself is the fifth daughter.
```
The original riddle has a girls name. It's a common riddle so the LLM would recognize the pattern of the riddle and assumes the name is the answer.
But if it's good at reasoning it should understand that if the name is a Boys name then it is not the 5th daughter.
Now the last daughter can be called anything, but because there is a vowel pattern it should pick Lulu to solve the puzzle, but that is the wrong answer in the original puzzle.
But that makes no sense, if you ask me that John's dad has two daughters, Mary and ???, there's no way I will say "John" as an answer. I'll simply assume he also has at least one son.
I’m also very interested to know how well these models handle tool calling as I haven’t been able to make it work after playing with them for a few hours. Looks promising tho.
update: I’ve tried to use lm-studio (like the author) and the tool request kept failing due to a mismatch in the prompt template. I guess they’ll fix it but seems sloppy from lm-studio not having tested this before release.
I was road testing tool calling in LM Studio a week ago against a few models marked with tool support, none worked, so I believe it may be a bug. Had much better luck with llama.cpp’s llama-server.
interestingly, i am also on M2 Max, and i get ~66 tok/s in LM Studio on M2 Max, with the same 131072.
I have full offload to GPU. I also turned on flash attention in advanced settings.
I found this surprising because that's such an old test that it must certainly be in the training data. I just tried to reproduce and I've been unable to get it (20B model, lowest "reasoning" budget) to fail that test (with a few different words).
TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.
The 20B model runs on my Mac laptop using less than 15GB of RAM.