I recently went to the LLM chat arena and tried my "test input" against the late...

Jlagreen · on Jan 28, 2025

So the focus is trying to find things where LLMs are bad instead of trying to find out where they are good and find applications for that?

That's basically like trying to embarass a IQ 180 student on emotional intelligence.

But I guess that's human nature to expect a machine to be 100x better than humanity on first try.

jiggawatts · on Jan 28, 2025

On the contrary, this is testing the LLMs on inputs they're supposed to be good at.

Fundamentally, this kind of problem is the same as language translation, text comprehension, or coding tasks. It just tests where the boundaries are of the LLM capabilities by pushing it to its limits.

I've noticed the LLMs bumping up against those very same limits in ordinary coding tasks. For example, if you have a prefix-suffix type naming convention for identifiers, depending on how the tokenizer splits these, the LLMs can either do very well or get muddled up. Similarly, they're not great at spotting small typos with very long identifiers because in their internal vector representations the correct and typo versions are very "close".

quesera · on Jan 28, 2025

> This test snippet simply repeats the same four-letter word in a paragraph many times using all of various possible meanings simultaneously

This sounds like fun. How does it do with an arbitrary quantity of "buffalo"s?

jiggawatts · on Jan 28, 2025

That's a known thing that would be in its training set.

I just made up my own thing that no AI model would have seen anywhere before.

It's pretty easy to create your own, just pick a word that is highly overloaded. It helps if it is also used as proper names, business names, place names, etc...

boilerupnc · on Jan 28, 2025

List of double meaning words as food for thought: https://word-lists.com/word-lists/100-english-words-with-mul...