> A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
vs
> An attention mechanism allows the modelling of dependencies without regard for the distance in either input or output sequences.
See the difference? In Markov Chains it is enough to know the previous state, while in transformers you need all previous states. It would be a great thing if we could reduce dependency of all previous states, like an RNN, maybe RWKV will do it.
The state for old school text markov chains are N words or similar, so you use the past N words to generate a new word, append it to the last N-1 words and now you got your next state. That is exactly what these language models does, you feed them a limited number of words as a state, and the next state is that word appended to the last and cut words in excess of the models limit.
The attention layer just looks at that bounded state. GPT-3 for example looks at a few thousand tokens, those are its state, it is bounded so it doesn't look at all previous tokens.
If you continue reading that Wikipedia article, you'll reach this point:
> A second-order Markov chain can be introduced by considering the current state and also the previous state, as indicated in the second table.
i.e., a higher-order Markov chain can depend on several of the previous states.
So, if a certain transformer model accepts up to 20k tokens as input, it can certainly be seen as a 20000'th order Markov chain process (whether it is useful to do so or not can be debated, but not the fact that it can be seen as such, since it complies with the definition of a Markov chain).
vs
> An attention mechanism allows the modelling of dependencies without regard for the distance in either input or output sequences.
See the difference? In Markov Chains it is enough to know the previous state, while in transformers you need all previous states. It would be a great thing if we could reduce dependency of all previous states, like an RNN, maybe RWKV will do it.