What? you and I cant see his "undisclosed" tests... but you better be sure that whatever model he is testing is specifically looking for these tests coming in over the api, or you know, absolutely everything for the cops
Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.
All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out
> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out
I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”
Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.