I’m surprised none of the frontier model companies have thrown this test in as a...

CjHuber · 2025-10-15T18:18:11 1760552291

Because then they would have to admit that they try to game benchmarks

ahofmann · 2025-10-15T18:32:58 1760553178

simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.

beefnugs · 2025-10-15T21:34:15 1760564055

What? you and I cant see his "undisclosed" tests... but you better be sure that whatever model he is testing is specifically looking for these tests coming in over the api, or you know, absolutely everything for the cops

Legend2440 · 2025-10-15T22:38:02 1760567882

You are welcome to test it yourself with whatever svg you want.

I am quite confident that they are not cheating for his benchmark, it produces about the same quality for other objects. Your cynicism is unwarranted.

jgalt212 · 2025-10-16T00:47:27 1760575647

OpenAI / Bing admit it's in its knowledge base.

are you aware of the pelican on a bicycle test?

Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.

esafak · 2025-10-16T03:15:43 1760584543

Knowing that does not make it easier to draw one though.

jgalt212 · 2025-10-16T12:57:04 1760619424

It doesn't make it harder.

zaphirplane · 2025-10-16T12:00:46 1760616046

What is special about the prompt

HDThoreaun · 2025-10-15T19:01:02 1760554862

All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

frtime3d · 2025-10-16T01:55:41 1760579741

> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”

Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.