Basically, you’re correct. All the rest below is just me blabbin’.
I read a bit about markov chains a few times, and we talked about them in a few of my comp sci classes, but I’m not an expert or anything. But the basic explanation of them is that its a function that returns the probability which state should occur next. A fancy way of saying “What the odds of X, Y, or Z happening”.
The classic example is you feed the chain something like a book, or all the books you have. Then you can ask what’s likely to come after a word, and it will either spit out a single word or it gives you a list of the likelihood of all words it “knows”. Texting apps, etc, word suggestion algorithms most likely used them.
You can train them on some sample data, then “customize” it so it “learns” from what the end user types. So my phone likes to say one person’s name after I type “hi” because I’ve typed “hi x” a lot. You can also have them “look back” more words, classic chains only look at the most current input. They’re much more complex and the time-space-energy required to compute them, so in general look back has only extended a few dozen words (in general).
All of that might sound familiar if you’ve heard much about how LLMs work. Because (in my opinion) its the same. I haven’t seen a compelling argument that LLMs/Machine Learning aren’t reducible to Markov Chains yet. (In the same way all computers are reducible to Turing Machines. Both machines suffer the same limits, one just hides some of the math so a normal human can use them.)
That isn’t to say they’re more powerful (LLMs can look back a few hundred words fairly ok), but they suffer the same limits Markovs inherently do. IE: The output is only as good as the inputs, deciding the output is subjective (because you can choose either the most common, or pick randomly, or …), and they fundamentally don’t “know” anything beyond their state machine. They can’t verify. They can’t research. As of right now, both systems can’t even look at their original inputs. (And if they included those inputs, that’d take on a few terabytes of data.) All those books, text messages, reddit posts, all reduced to relations of words represented by probabilities.
I swear feeding a Markov’s output to itself was discussed in one of my classes, and the professor was like, yeah it improves it a bit, but if you do it more then X amount then the whole thing falls apart. This was before 2020, in undergrad classes. People 100% said the same about LLMs before they started posting online, and now the silicon valley dweebs are hitting the same problem. I swear tech bros, besides being the most hiterite prone, love to recreate disasters. Same happened with “crypto” shit and basic financial schemes.
TLDR: Fuck “ai”. My AI class sucked because it was in Java, but at least we were making a silly game AI (specific game withheld for privacy, but it was 20+ years old). If they’re not talking about min-maxing and search trees and master agents, and instead are pushing this crap, comp sci is overcooked.
It is not true that a pretrained transformer is reducible to a Markov chain in the same way that a computation is reducible to a turing machine. While any computation could be achieved by a turing machine (given arbitary time), not every pretrained transformer could be described by a (sufficiently complicated) Markov chain of arbitrary length.
One reason is the attention mechanism, which allows a transformer to weight some tokens in a sequence as differently than others.
But the biggest difference is that a Markov chain can only consider an entire state at once, and produce the next state sequentially, while transformers can consider many parts of a sequence as individual states. Transformers are also capable of autoregression (they’re basically just RNNs+attention heads), while Markov chains are not - new states transform old states, and do not simply append.
Now, if you take the attention mechanism out of the transformer, you’re basically just left with an RNN, and if you then take the autoregression out of the RNN, you’ve got a Markov chain in the limit, probably. So you could argue a Markov chain is in the limit of an LLM (kind of obvious, since you should expect an LLM trained on a Markov chain to predict it totally), but you could never argue an LLM can be produced in the limit of a Markov chain (you can train any arbitrarily large Markov chain and it will never get anywhere close to an LLM).
Basically, you’re correct. All the rest below is just me blabbin’.
I read a bit about markov chains a few times, and we talked about them in a few of my comp sci classes, but I’m not an expert or anything. But the basic explanation of them is that its a function that returns the probability which state should occur next. A fancy way of saying “What the odds of X, Y, or Z happening”.
The classic example is you feed the chain something like a book, or all the books you have. Then you can ask what’s likely to come after a word, and it will either spit out a single word or it gives you a list of the likelihood of all words it “knows”. Texting apps, etc, word suggestion algorithms most likely used them.
You can train them on some sample data, then “customize” it so it “learns” from what the end user types. So my phone likes to say one person’s name after I type “hi” because I’ve typed “hi x” a lot. You can also have them “look back” more words, classic chains only look at the most current input. They’re much more complex and the time-space-energy required to compute them, so in general look back has only extended a few dozen words (in general).
All of that might sound familiar if you’ve heard much about how LLMs work. Because (in my opinion) its the same. I haven’t seen a compelling argument that LLMs/Machine Learning aren’t reducible to Markov Chains yet. (In the same way all computers are reducible to Turing Machines. Both machines suffer the same limits, one just hides some of the math so a normal human can use them.)
That isn’t to say they’re more powerful (LLMs can look back a few hundred words fairly ok), but they suffer the same limits Markovs inherently do. IE: The output is only as good as the inputs, deciding the output is subjective (because you can choose either the most common, or pick randomly, or …), and they fundamentally don’t “know” anything beyond their state machine. They can’t verify. They can’t research. As of right now, both systems can’t even look at their original inputs. (And if they included those inputs, that’d take on a few terabytes of data.) All those books, text messages, reddit posts, all reduced to relations of words represented by probabilities.
I swear feeding a Markov’s output to itself was discussed in one of my classes, and the professor was like, yeah it improves it a bit, but if you do it more then X amount then the whole thing falls apart. This was before 2020, in undergrad classes. People 100% said the same about LLMs before they started posting online, and now the silicon valley dweebs are hitting the same problem. I swear tech bros, besides being the most hiterite prone, love to recreate disasters. Same happened with “crypto” shit and basic financial schemes.
TLDR: Fuck “ai”. My AI class sucked because it was in Java, but at least we were making a silly game AI (specific game withheld for privacy, but it was 20+ years old). If they’re not talking about min-maxing and search trees and master agents, and instead are pushing this crap, comp sci is overcooked.
It is not true that a pretrained transformer is reducible to a Markov chain in the same way that a computation is reducible to a turing machine. While any computation could be achieved by a turing machine (given arbitary time), not every pretrained transformer could be described by a (sufficiently complicated) Markov chain of arbitrary length.
One reason is the attention mechanism, which allows a transformer to weight some tokens in a sequence as differently than others.
But the biggest difference is that a Markov chain can only consider an entire state at once, and produce the next state sequentially, while transformers can consider many parts of a sequence as individual states. Transformers are also capable of autoregression (they’re basically just RNNs+attention heads), while Markov chains are not - new states transform old states, and do not simply append.
Now, if you take the attention mechanism out of the transformer, you’re basically just left with an RNN, and if you then take the autoregression out of the RNN, you’ve got a Markov chain in the limit, probably. So you could argue a Markov chain is in the limit of an LLM (kind of obvious, since you should expect an LLM trained on a Markov chain to predict it totally), but you could never argue an LLM can be produced in the limit of a Markov chain (you can train any arbitrarily large Markov chain and it will never get anywhere close to an LLM).