Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I recently went to the LLM chat arena and tried my "test input" against the latest frontier models that GPT 3 failed on. This test snippet simply repeats the same four-letter word in a paragraph many times using all of its various possible meanings simultaneously. The request to the AI is to put the meaning of each usage of the word next to it in brackets.

None of the frontier models can do this perfectly. They all screw up to various degrees in various interesting ways. A schoolkid could do this flawlessly.

This is not some contrived test with bizarre picture puzzles as seen in ARC-AGI or testing obscure knowledge about bleeding-edge scientific research. It's simple English comprehension using a word my toddler knows already!

It does reveal the fundamental flaw in all transformer-based models: They're just shifting vectors around with matrices, and are unable to deal with many categories of inputs that cause overlaps or bring too many of the tokens too close to each other in some internal representation. They get muddled up and confused, resulting in errors in the output.

I see similar effects when using LLMs for programming: They get confused when there are many usages of the same identifier or keyword, but with some subtle difference such as being inside a comment, string, or in a local context where the meaning is different.

I suspect this will be eventually fixed, but I haven't seen any fundamental improvement in three years.



So the focus is trying to find things where LLMs are bad instead of trying to find out where they are good and find applications for that?

That's basically like trying to embarass a IQ 180 student on emotional intelligence.

But I guess that's human nature to expect a machine to be 100x better than humanity on first try.


On the contrary, this is testing the LLMs on inputs they're supposed to be good at.

Fundamentally, this kind of problem is the same as language translation, text comprehension, or coding tasks. It just tests where the boundaries are of the LLM capabilities by pushing it to its limits.

I've noticed the LLMs bumping up against those very same limits in ordinary coding tasks. For example, if you have a prefix-suffix type naming convention for identifiers, depending on how the tokenizer splits these, the LLMs can either do very well or get muddled up. Similarly, they're not great at spotting small typos with very long identifiers because in their internal vector representations the correct and typo versions are very "close".


> This test snippet simply repeats the same four-letter word in a paragraph many times using all of various possible meanings simultaneously

This sounds like fun. How does it do with an arbitrary quantity of "buffalo"s?


That's a known thing that would be in its training set.

I just made up my own thing that no AI model would have seen anywhere before.

It's pretty easy to create your own, just pick a word that is highly overloaded. It helps if it is also used as proper names, business names, place names, etc...


List of double meaning words as food for thought: https://word-lists.com/word-lists/100-english-words-with-mul...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: