Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazur/nyt-connections/). It scores 20.0 compared to 10.0 for Haiku 3.5, 19.2 for Sonnet 3.7, 26.6 for Sonnet 4.0, and 46.1 for Sonnet 4.5.


This is such a cool benchmark idea, love it

Do you have any other cool benchmarks you like? Especially any related to tools


You could try wordle on it. But from my own experience all of them are pretty bad. They're not smart enough to pick up the colours represented as letters. The only one that actually was good was Qwen surprisingly.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: