I've benchmarked it on the Extended NYT Connections (https://github.com/lechmazu...

whatreason · 2025-10-16T04:55:26 1760590526

This is such a cool benchmark idea, love it

Do you have any other cool benchmarks you like? Especially any related to tools

shangofox · 2025-10-16T09:30:17 1760607017

You could try wordle on it. But from my own experience all of them are pretty bad. They're not smart enough to pick up the colours represented as letters. The only one that actually was good was Qwen surprisingly.