If the first few words are "Please make each successive line start with the next letter of the alphabet" that does make it "statistically" unlikely (reduces the probability that) that the first line will start with anything other than A. Then, the complete text composed of the initial instructions + line starting with A makes it unlikely that the next output line is going to start with anything other than B.
The input-so-far influences the probability of the next word in complex ways. Due to the number of parameters in the model, this dependency can be highly nontrivial, on par with the complexity of a computer program. Just like a computer program can trivially generate an A line before switching its internal state so that the next generated line is a B line, so does the transformer since it is essentially emulating an extremely complex function.
The input-so-far influences the probability of the next word in complex ways. Due to the number of parameters in the model, this dependency can be highly nontrivial, on par with the complexity of a computer program. Just like a computer program can trivially generate an A line before switching its internal state so that the next generated line is a B line, so does the transformer since it is essentially emulating an extremely complex function.