Just posted my initial impressions, took a couple of hours to write them up beca...

hrpnk · 2025-08-05T21:50:51 1754430651

I tried to generate a streamlit dashboard with MACD, RSI, MA(200). 1:0 for qwen3 here.

qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with a fully working dashboard, graphs, and recent data fetched from yfinance.

gpt-oss-20b mxfp4's code had a missing datatime import and when fixed delivered a dashboard without any data and with starting date of Aug 2020. Having adjusted the date, the update methods did not work and displayed error messages.

teitoklien · 2025-08-05T22:13:36 1754432016

for now, i wouldnt rank any model from openai in coding benchmarks, despite all the false messaging they are giving, almost every single model openai has launched even the high end o3 expensive models are absolutely monumentally horrible at coding tasks. So this is expected.

If its decent in other tasks, which i do find openai often being better than others at, then i think its a win, especially a win for the open source community that even AI labs that pionered the hype of Gen AI who didnt want to ever launch open models are now being forced to launch them. That is definitely a win, and not something that was certain before.

dongobread · 2025-08-05T22:33:31 1754433211

It is absolutely awful at writing and general knowledge. IMO coding is its greatest strength by far.

mplewis · 2025-08-06T00:39:31 1754440771

Sure sounds like they're not good at anything in particular, then.

xwolfi · 2025-08-06T02:36:11 1754447771

welcome to 3DTV hype, LLM are useless...

teitoklien · 2025-08-07T09:27:41 1754558861

not really, claude is amazing, which is why I pay for Claude Max, its insanely amazing and useful, its just OpenAI's one isnt

pxc · 2025-08-06T00:06:25 1754438785

NVIDIA will probably give us nice, coding-focused fine-tunes of these models at some point, and those might compare more favorably against the smaller Qwen3 Coder.

iJohnDoe · 2025-08-06T00:53:06 1754441586

What is the best local coder model that that can be used with ollama?

Maybe a too opened ended question? I can run the deepseek model locally really nicely.

bigyabai · 2025-08-06T01:36:53 1754444213

Probably Qwen3-Coder 30B, unless you have a titanic enough machine to handle a serious 480B model.

pxc · 2025-08-06T02:59:50 1754449190

Is the DeepSeek model you're running a distill, or is it the 671B parameter model?

kgeist · 2025-08-06T09:04:22 1754471062

>I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs

I have a bunch of scripts that use tool calling. Qwen-3-32B handles everything flawlessly at 60 tok/sec. Gpt-oss-120B breaks in some cases and runs at mere 35 tok/sec (doesn't fit on the GPU).

But I hope there's still some ironing out to do in llama.cpp and in the quants. So far it feels lackluster compared to Qwen3-32B and GLM-4.5-Air

markasoftware · 2025-08-06T03:05:22 1754449522

The space invaders game seems like a poor benchmark. Both models understood the prompt and generated valid, functional javascript. One just added more fancy graphics. It might just have "use fancy graphics" in its system prompt for all we know.

simonw · 2025-08-06T04:50:01 1754455801

The way I run these prompts excludes a system prompt - I'm hitting the models directly.

markasoftware · 2025-08-06T05:03:05 1754456585

still, if you ask this open model to generate a fancy space invaders game with polish, and then ask the other model to generate a bare-bones space invaders game with the fewest lines of code, I think there's a good chance they'd switch places. This doesn't really test the models ability to generate a space invaders game, so much as it tests their tendency to make an elaborate vs simple solution.

simonw · 2025-08-06T05:12:32 1754457152

My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.

It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.

It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.

dennisy · 2025-08-06T12:56:19 1754484979

Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.

simonw · 2025-08-06T13:59:51 1754488791

The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.

Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/

lossolo · 2025-08-06T16:28:11 1754497691

Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.

simonw · 2025-08-06T19:19:52 1754507992

I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.

Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.

You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...

Illustrative extract:

> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.

> When direction changes, move all aliens down by step (e.g., 10 px).

The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).

lossolo · 2025-08-06T21:10:04 1754514604

12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1

Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.

As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.

If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.

simonw · 2025-08-07T01:44:02 1754531042

If the models are memorizing and regurgitating from their training data, how come every model I've tried this with produces entirely different code?

Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!

I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.

lossolo · 2025-08-07T03:18:35 1754536715

Transformer does not need to emit a byte for byte clone of a training example to benefit from having seen it. It can store a distributed representation of many near duplicate implementations and then sample a novel linear combination. That still short circuits algorithm design so the burden of discovering the game loop, collision logic, sprite sheet etc. was ALREADY SOLVED during pre training.

When you temperature sample the same model twice you also get "different" code, diversity alone is not evidence of new reasoning. What matters is functional novelty under controlled transformations (renamed variables, resized canvas, obfuscated asset file names etc). On such metamorphic rewrites, models that appear brilliant on canonical prompts suddenly collapse, a hallmark of shallow pattern matching.

The paper I mentioned in my previos comment shows SOTA coding LLMs scoring 70%+ on SWE bench verified yet dropping 10–47% when the very same issues are paraphrased or drawn from unseen repos, even though the task semantics are identical. That is classic memorisation, just fuzzier than a CRC match.

As to qwen, even at 4 bit per weight, a 4B model retains ≈ 2.1 GB of entropy so enough to memorise tens of thousands of full game loops. The reason it garbled the alien movement logic is probably that its limited capacity forced lossy compression, so the behaviour you saw is typical of partially recalled code patterns whose edge cases were truncated during training. That’s still interpolation over memorised fragments, just with fewer fragments to blend. And this is something that is actually proven (https://arxiv.org/abs/2406.15720v1) by controlled fact memorisation studies and extraction attacks up through 70B params show a monotone curve so basically each extra order of magnitude adds noticeably more verbatim or near verbatim recall. So a 20B model succeeds where a 4B one fails because the former crossed the "capacity per training token" threshold for that exemplar. So nothing magical there.

Don't get me wrong, I’m not arguing against interpolation per se, generalising between held out exemplars is precisely what we want. The problem is that most public "just write space invaders” demos never verify that the endpoints were truly unseen. Until they do, a perfect clone is compatible with nothing deeper than glorified fuzzy lookup.

simonw · 2025-08-07T05:30:51 1754544651

This is a great explanation, thanks for putting it together.

It more or less fits my fuzzy mental model of how this stuff works.

I'm completely fine with my test prompt taking advantage of this - the point of "implement space invaders" is to explore how well it can construct a game of that shape based on the examples that it has seen in its training data, especially in comparison to other models.

I'm not trying for a test of ability to produce a unique new game - I want a short prompt that gets it to output some HTML and JavaScript that I can then interact with.

throwdbaaway · 2025-08-06T02:25:37 1754447137

There is no way that gpt-oss-120b can beat the much larger Kimi-K2-Instruct, Qwen3 Coder/Instruct/Thinking, or GLM-4.5. How did you arrive at this rather ridiculous conclusion? The current sentiment in r/LocalLLaMA is that gpt-oss-120b is around Llama-4 Scout level. But it is indeed the best in refusal.

h4ny · 2025-08-06T04:13:57 1754453637

> TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.

That's just straight up not the case. Not sure how you can jump to that conclusion not least when you stated that you haven't tested tool calling in your post too.

Many people in the community are finding it substantially lobotomized to the point that there are "safe" memes everywhere now. Maybe you need to develop better tests that and pay more attention to benchmaxxing.

There are good things that came out of these release from OpenAI but we'd appreciate more objective analyses...

simonw · 2025-08-06T04:45:57 1754455557

If you read my full post, it ends with this:

> I’m waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it’s likely that OpenAI now offer the best available open weights models.

You told me off for jumping to conclusions and in the same comment quoted me saying "I think OpenAI may have taken" - that's not a conclusion, it's tentative speculation.

h4ny · 2025-08-06T05:11:00 1754457060

I did read that and it doesn't change what I said about your comment on HN, I was calling out the fact that you are making a very bold statement without having done careful analysis.

You know you have a significant audience, so don't act like you don't know what you're doing when you chose to say "TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs" then defend what I was calling out based on word choices like "conclusions" (I'm sure you have read conclusions in academic journals?), "I think", and "speculation".

simonw · 2025-08-06T05:14:49 1754457289

I'm going to double down on "I think OpenAI may have taken the medal..." not being a "bold statement".

I try to be careful about my choice of words, even in forum comments.

bavell · 2025-08-06T12:59:22 1754485162

> I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.

IMO, the "I think..." bit could be ambiguous and read as, "In my opinion, OpenAI may have...".

I agree with you it's not a hard/bold endorsement but perhaps leading with the disclaimer that you're reserving final judgement could assuage these concerns.

irthomasthomas · 2025-08-06T02:55:28 1754448928

My llm agent is currently running an experiment generating many pelicans. It will compare various small model consortiums against the same model running solo. It should push new pelicans to the repo after run. The horizon-beta is up already, not small or opensource but tested it anyway, and you can already see an improvement using 2+1 (2 models + the arbiter) for that model.

https://irthomasthomas.github.io/Pelicans-consortium/ https://github.com/irthomasthomas/Pelicans-consortium

coltonv · 2025-08-05T21:38:17 1754429897

What did you set the context window to? That's been my main issue with models on my macbook, you have to set the context window so short that they are way less useful than the hosted models. Is there something I'm misisng there?

hrpnk · 2025-08-05T21:51:37 1754430697

With LM Studio you can configure context window freely. Max is 131072 for gpt-oss-20b.

coltonv · 2025-08-05T21:56:09 1754430969

Yes but if I set it above ~16K on my 32gb laptop it just OOMs. Am I doing something wrong?

mekpro · 2025-08-06T00:29:33 1754440173

try enable flash attention and offload all layer to GPU

simonw · 2025-08-05T22:59:16 1754434756

I punted it up to the maximum in LM Studio - seems to use about 16GB of RAM then, but I've not tried a long prompt yet.

EagnaIonat · 2025-08-06T12:00:26 1754481626

Nice write up!

One test I do is to give a common riddle but word it slightly to see if it can actually reason.

For example:

"Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???"

The 20B model kept picking the answer of the original riddle, even after explaining extra information to it.

The original riddle is:

"Janes dad has five daughters, Lala, Lele, Lili, Lolo and ???"

jfreax · 2025-08-06T14:13:11 1754489591

Nice test. I tested it with llama.cpp and the 20b model. He got the answer correct at first try on my side. The internal/thinking dialogue is fun to read - it could be my own inner dialogue ;)

``` > "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ??? <|channel|>analysis<|message|>The question: "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???". It's a common riddle. The answer is Bob? Actually "Bob's dad has five daughters: Lala, Lele, Lili, Lolo and..." The riddle is to trick: it's "Bob" himself is a daughter? Wait Bob's dad has five daughters: Lala, Lele, Lili, Lolo, and Bob? But Bob is a name often male. But the riddle says Bob's dad. So Bob is a person who has a dad, and that dad has five daughters. But Bob could be the fifth daughter? That would be weird: Bob is likely the name of the fifth daughter. So answer: Bob. But sometimes the answer is "Lulu" or "Bob's sister"? Let's think. The riddle is a trick: It's "Bobs dad has five d aughters: Lala, Lele, Lili, Lolo, and ... Bob." Because the riddle begins with "Bobs dad has five daughters" implying Bob is one of the daughters. So the missing daughter is Bob. So answer: "Bob".

Thus answer: Bob. But sometimes the riddle goes "Bob's dad has five daughters, Lala, Lele, Lili, Lolo, and ... Who is the fifth?" The answer: Bob (the name of the person speaking, "Bobs"). But in riddle it's "Bobs dad has five daughters , Lala, Lele, Lili, Lolo and ???". The missing name is "Bob." So answer: Bob.

Thus the answer: Bob. Or maybe it's "Boba"? But no.

Hence answer: Bob.

philipwhiuk · 2025-08-06T13:51:20 1754488280

Your riddle assumes the dad doesn't have any sons.

taikahessu · 2025-08-06T13:23:15 1754486595

A Daughter Named Bob, what a great name for AI documentary.

EagnaIonat · 2025-08-09T06:54:58 1754722498

We call her Bobby Tables. https://xkcd.com/327/

clbrmbr · 2025-08-06T12:07:19 1754482039

I don’t get it. Wouldn’t it be Lulu in both cases?

EagnaIonat · 2025-08-09T06:58:08 1754722688

The original riddle has a girls name. It's a common riddle so the LLM would recognize the pattern of the riddle and assumes the name is the answer.

But if it's good at reasoning it should understand that if the name is a Boys name then it is not the 5th daughter.

Now the last daughter can be called anything, but because there is a vowel pattern it should pick Lulu to solve the puzzle, but that is the wrong answer in the original puzzle.

BrawnyBadger53 · 2025-08-06T12:24:11 1754483051

Presumably Jane is a girl and therefore the fifth daughter in the original riddle.

throwaway106382 · 2025-08-06T12:19:36 1754482776

It’s Bob or Jane.

The dad of <Person> has 5 daughters. Four are listed off. So the answer for the fifth is <Person>.

philipwhiuk · 2025-08-06T13:50:06 1754488206

Except having five daughters doesn't prevent them also having 20 sons one of whom is called Bob.

throwaway106382 · 2025-08-06T16:44:06 1754498646

That’s why it’s a riddle.

stavros · 2025-08-06T22:16:40 1754518600

But that makes no sense, if you ask me that John's dad has two daughters, Mary and ???, there's no way I will say "John" as an answer. I'll simply assume he also has at least one son.

blueplanet200 · 2025-08-06T12:18:41 1754482721

Original riddle the answer is Jane.

rmonvfer · 2025-08-05T21:42:19 1754430139

I’m also very interested to know how well these models handle tool calling as I haven’t been able to make it work after playing with them for a few hours. Looks promising tho.

rmonvfer · 2025-08-05T23:40:45 1754437245

update: I’ve tried to use lm-studio (like the author) and the tool request kept failing due to a mismatch in the prompt template. I guess they’ll fix it but seems sloppy from lm-studio not having tested this before release.

month13 · 2025-08-06T04:50:47 1754455847

I was road testing tool calling in LM Studio a week ago against a few models marked with tool support, none worked, so I believe it may be a bug. Had much better luck with llama.cpp’s llama-server.

GodelNumbering · 2025-08-05T21:35:52 1754429752

> The 20B model runs on my Mac laptop using less than 15GB of RAM.

I was about to try the same. What TPS are you getting and on which processor? Thanks!

hrpnk · 2025-08-05T21:53:05 1754430785

gpt-oss-20b: 9 threads, 131072 context window, 4 experts - 35-37 tok/s on M2 Max via LM Studio.

rt1rz · 2025-08-05T22:47:26 1754434046

interestingly, i am also on M2 Max, and i get ~66 tok/s in LM Studio on M2 Max, with the same 131072. I have full offload to GPU. I also turned on flash attention in advanced settings.

hrpnk · 2025-08-06T17:56:16 1754502976

Thank you! Flash attention gives me a boost to ~66 tok/s indeed.

mdz4040 · 2025-08-06T03:26:42 1754450802

55 token/s here on m4 pro, turning on flash attention puts it to 60/s.

mekpro · 2025-08-06T00:30:56 1754440256

i got 70 token/s on m4 max

mhitza · 2025-08-06T00:40:57 1754440857

That M4 Max is really something else, I get also 70 tokens/second on eval on a RTX 4000 SFF Ada server GPU.

mudkipdev · 2025-08-06T03:29:54 1754450994

Hasn't nailed the strawberry test yet

pxc · 2025-08-06T05:04:37 1754456677

I found this surprising because that's such an old test that it must certainly be in the training data. I just tried to reproduce and I've been unable to get it (20B model, lowest "reasoning" budget) to fail that test (with a few different words).

quatonion · 2025-08-06T10:54:00 1754477640

I am starting to get the impression the strawberry test is an OpenAI watermark, more than an actual problem.

It is a good way to detect if another model was trained on your data for example, or is a distillation/quant/ablation.