> it obviously doesn't Why?

Bishonen88 · on March 26, 2023

simply because I think that it's rather statistically unlikely, that just because my first word started with "A", the next word should start with "B", "C" ...

feanaro · on March 26, 2023

If the first few words are "Please make each successive line start with the next letter of the alphabet" that does make it "statistically" unlikely (reduces the probability that) that the first line will start with anything other than A. Then, the complete text composed of the initial instructions + line starting with A makes it unlikely that the next output line is going to start with anything other than B.

The input-so-far influences the probability of the next word in complex ways. Due to the number of parameters in the model, this dependency can be highly nontrivial, on par with the complexity of a computer program. Just like a computer program can trivially generate an A line before switching its internal state so that the next generated line is a B line, so does the transformer since it is essentially emulating an extremely complex function.

detrites · on March 26, 2023

My understanding is, if you have 175 billion parameters of 16-bit values that all effectively transact (eg, multiply) together, the realm of possibility is 175b^65536; really rather a large number of encodable potentials.

The length and number of probability chains that can be discovered in such a space is therefore sufficient for the level of complexity being analysed and effectively "encoded" from the source text data. Which is why it works.

Obviously, as the weights become fixed on particular values by the end of training, not all of those possibilities are required. But they are all in some sense "available" during training, and required and so utilised in that sense.

Think of it as expanding the corpus as water molecules into a large cloud of possible complexity, analysing to find the channels of condensation that will form drops, then compress it by encoding only the final droplet locations.

missingdays · on March 26, 2023

It's statistically unlikely if this rule isn't specified before. It's statistically likely if this rule was