Incomplete training data is kind of a pointless thing to measure.
Isn’t incomplete data the whole point of learning in general? The reason why we have machine learning is because data was incomplete. If we had complete data we don’t need ml. We just build a function that maps the input to output based off the complete data. Machine learning is about filling in the gaps based off of a prediction.
In fact this is what learning in general is doing. It means this whole thing about incomplete data applies to human intelligence and learning as well.
Everything this theory is going after basically has application learning and intelligence in general.
So sure you can say that LLMs will always hallucinate. But humans will also always hallucinate.
The real problem that needs to be solved is: how do we get LLMs to hallucinate in the same way humans hallucinate?
> Machine learning is about filling in the gaps based off of a prediction.
I think this is a generous interpretation of network-based ML. ML was designed to solve problems. We had lots of data, and we knew large amounts of data could derive functions (networks) as opposed to deliberate construction of algorithms with GOFAI.
But "intelligence" with ML as it stands now is not how humans think. Humans do not need millions of examples of cats to know what a cat is. They might need two or three, and they can permanently identify them later. Moreover, they don't need to see all sorts of "representative" cats. A human could see a single instance of black cat and identify all other types of house cats as cats correctly. (And they do: just observe children).
Intelligence is the ability to come up with a solution without previous knowledge. The more intelligent an entity is, the less data it needs. As we approach more intelligent systems, they will need less data to be effective, not more.
> Humans do not need millions of examples of cats to know what a cat is.
We have evolved over time to recognize things in our environment. We also don’t need to be told that snakes are dangerous as many humans have an innate understanding of that. Our training data is partially inherited.
DNA is around ~725 megabytes. There is no "snakes are dangerous" encoded in there. Our training data is instinctual behaviors we recognize from our sensory inputs.
The idea that we're "pre-trained" the way an LLM is (with hundreds of lifetimes of actual sensory experience) is incorrect.
> "The results show that the brain has special neural circuits to detect snakes, and this suggests that the neural circuits to detect snakes have been genetically encoded," Nishijo said.
> The monkeys tested in the experiment were reared in a walled colony and neither had previously encountered a real snake.
All this is saying is our neural architecture has specific places that respond to danger and the instinct of fear. Anyone who has seen an MRI knows this is the case. It does not mean actual knowledge of snakes is encoded in our DNA.
Our brains are "trained" on the data they receive during early development. To the degree that evolutionary pressures stored "data," it stored data about how to make our brains (compute) or physiology more effective.
Modern ML tries to shortcut this by making the architecture dumb and the data numerous. If the creators of ML were in charge of a forced evolution, they'd be arguing we need to make DNA 100s of gigabytes and that we needed to store all the memories of our ancestors in it.
Yes, we are definitely talking about two different processes. Biology is far more complex, nuanced and inscrutable. We don’t understand what all is in our DNA. We do have strong ideas about it.
When it comes down to it, code is data and DNA is code. There are natural pressures to have less DNA so the hundreds of MB of DNA in humans might be argued to be somewhat minimal. If you have ever dealt with piles of handcrafted code that is meant to be small, you’ll likely have seen some form of spaghetti code… which is what I liken DNA to. Instead of it being written with thought and intention, it’s written with predation, pandemics, famine, war etc.
I agree we tend to simplify our artificial networks, largely because we haven’t figured out how to do better yet. The space is wide open and biology has extreme variety in the examples to choose from. Nature “figured out” how to encode information into the very structure of a neural network. The line defining “code” and “data” is thus heavily blurred and any argument about how humans are far superior because of the “reduced number of training examples” is definitely missing the millennia of evolution that created us in the first place.
If we decided to do evolution and self modifying networks then we will likely look for solutions that converge to the smallest possible network. It will be interesting to watch this play out :)
> The line defining “code” and “data” is thus heavily blurred and any argument about how humans are far superior because of the “reduced number of training examples” is definitely missing the millennia of evolution that created us in the first place.
I disagree. The line is quite clear. Our factual memories do not persist from one generation to another. Yet this is what modern ML does.
The "data" encoded in DNA is not about knowledge or facts, it is knowledge about our architecture. Modern ML is a like a factory that outputs widgets based upon knowledge of lots of pre-existing widgets. DNA is a like a factory that outputs widgets based upon lots of previous pre-existing factories.
The factory is the "code" or "verb." The widgets are the "data" or "nouns." Completely separable and objectively so.
If you are talking about facts like "cos(0) = 1" then yes, of course I agree; Those kinds of facts do not persist simply by giving birth. However, that's a very narrow view of "data" or "knowledge" when talking about biological systems. Humans use culture and collaboration to pass that kind of knowledge on. Spiders don't have culture and collaboration in the same sense. We are wired/evolved with the ability to form communities which is a different kind of knowledge altogether.
It seems like you are simultaneously arguing that humans (who have a far more complex network, or set of networks than current LLMs) can recognize a class of something given a single or few specific examples of the thing while also arguing that the structure has nothing to do with the success of that. The structure was created over many generations going all the way back to the first single-celled organisms over 3.7 billion years ago. The more successful networks for what eventually became humans were able to survive based on the traits that we largely have today. Those traits were useful back then for understanding that one cat might act like another cat without needing to know all cats. There are things our brains can just barely do (eg: higher level mathematics) that a slightly different structure might enable... and may have existed in the past only to be wiped out by someone who could think a little faster.
Also, check out epigenetics. DNA is expressed differently based on environmental factors of previous and current generations. The "factories" you speak of aren't so mechanical as you would make them seem.
All of this is to say, Human biology is wonderfully complicated. Comparing LLMs to humanity is going to be fraught with issues because they are two very different things. Human intelligence is a combination of our form, our resources and our passed on knowledge. So far, LLMs are simply another representation of our passed on knowledge.
>> Machine learning is about filling in the gaps based off of a prediction.
>I think this is a generous interpretation of network-based ML.
This is False.
The definition of What you actually do with machine learning is Literally filling in the gaps based on prediction. If you can't see this you may not intuitively understand what ML is in actuality doing.
Let's examine ML in it's most simplest form. Linear Regression based off of 2 data points with a single input X and single output Y:
(0, 0), (3, 3)
With linear regression this produces a model that's equivalent to : y = x
with y = x you've literally filled an entire domain of infinite possible inputs and outputs from -infinite to positive infinite. From two data points I can now output points like (1,1), (2,2),(343245,343245) literally from the model y=x.
The amount of data given by the model is so overwhelmingly huge that basically it's infinite. You feed in random data into the model at speeds of 5 billion numbers per nano second you will NEVER hit an original data point and you will always be creating novel data from the model.
And there's no law that says the linear regression line even has to TOUCH a data point.
ML is simply a more complex form of what I described above with thousands of values for input, thousands of values for output and thousands of datapoints and a different best fit curve (as opposed to a straight line, to fit into the data points). EVEN with thousands of datapoints you know an equation for a best fit curve basically covers a continuous space and thus holds almost an infinite amount of creative data compared with the amount of actual data points.
Make no mistake. All of ML is ALL about novel data output. It is literally pure creativity..... not memory at all. I'm baffled by all these people thinking that ML models are just memorizing and regurgitating.
The problem this paper is talking about is that the outputs are often illusory. Or to circle back to my comment the "predictions" are not accurate.
Without any ability to reason about the known facts, are we better off with LLMs trying to interpolate at all rather than acting as a huge search space that returns only references?
If an LLM has the exact answer needed it could simply be returned without needing to be rephrased or predicted at all.
If the exact answer is not found, or if the LLM attempts to paraphrase the answer through prediction, isn't it already extrapolating? That doesn't even get to the point where it is attempting to combine multiple pieces of training data or fill in blanks that it hasn't seen.
Isn’t incomplete data the whole point of learning in general? The reason why we have machine learning is because data was incomplete. If we had complete data we don’t need ml. We just build a function that maps the input to output based off the complete data. Machine learning is about filling in the gaps based off of a prediction.
In fact this is what learning in general is doing. It means this whole thing about incomplete data applies to human intelligence and learning as well.
Everything this theory is going after basically has application learning and intelligence in general.
So sure you can say that LLMs will always hallucinate. But humans will also always hallucinate.
The real problem that needs to be solved is: how do we get LLMs to hallucinate in the same way humans hallucinate?