> A Markov chain or Markov process is a stochastic model describing a sequence o...

Jensson · on March 25, 2023

The state for old school text markov chains are N words or similar, so you use the past N words to generate a new word, append it to the last N-1 words and now you got your next state. That is exactly what these language models does, you feed them a limited number of words as a state, and the next state is that word appended to the last and cut words in excess of the models limit.

The attention layer just looks at that bounded state. GPT-3 for example looks at a few thousand tokens, those are its state, it is bounded so it doesn't look at all previous tokens.

fjkdlsjflkds · on March 25, 2023

If you continue reading that Wikipedia article, you'll reach this point:

> A second-order Markov chain can be introduced by considering the current state and also the previous state, as indicated in the second table.

i.e., a higher-order Markov chain can depend on several of the previous states.

So, if a certain transformer model accepts up to 20k tokens as input, it can certainly be seen as a 20000'th order Markov chain process (whether it is useful to do so or not can be debated, but not the fact that it can be seen as such, since it complies with the definition of a Markov chain).