Just checked Gemini doesn’t go so this. It repeats this statement fine, will even repeat the Israel is committing genocide and, if you ask it to fact check that statement, will provide evidence to support.
People on Reddit tried this a bunch of times with different models. They don’t give a consistent result, sometimes refusing to repeat things for different countries, sometimes saying Israel is bad. As is pretty typical for LLMs.
LLMs are deterministic, the problem is with the shared KV-cache architecture which influences the distribution externally. E.g the LLM is being influenced by other concurrent sessions.
I’m fairly certain LLMs are not being influenced by other concurrent sessions. Can you share why you think otherwise? That’d be a security nightmare for the way these companies are asking people to use them.
Any shared cache of this type makes behaviour non-deterministic. The KV-Cache is what does prompt caching, look at each word of this message, now imagine what the LLM does to give you a new response each time. Let’s say this whole paragraph as the first message from you and you just pressed send.
Because the LLM is supposedly stateless, now the LLM is reading all this text from the beginning, and in non-cached inference, it has to repeat it, like token by token, which is useless computation because it already responded to all this previously. Then when it sees the last token, the system starts collecting the real response, token by token, each gets fed back to the model as input and it chugs along until it either outputs a special token stating that it’s done responding or the system stops it due to a timeout or reaching a tool call limit or something. Now you got the response from the LLM, and when you send the next message, this all has to happen all over again.
Now imagine if Claude or Gemini had to do that with their 1 million token context window. It would not be computationally viable.
So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.
So now comes the issue: allocating a dedicated region for the KV-cache per user on VRAM is a big deal. Again try to imagine Gemini/Claude with their 1M context windows. It’s economically unviable.
So what do ML science buffs come up with? A shared KV-Cache architecture. All users share the same cache on any particular node. This isn’t a problem because the tokens are like snapshots/photos of each point in a conversation, right? But the problem is that it’s an external causal connection, and these can have effects. Like two conversations that start with “hi” or “What do you think about cats?” Could in theory influence one another. If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.
Note that a token is an approximation of what the conversation means at one point in time. So while astronomically unlikely, collisions could happen in a shared architecture scaling to millions of concurrent users.
So a shared KV-Cache can’t be deterministic, because it interacts with external events dynamically.
The guts of an LLM are 100% deterministic. At the very last step a probability distribution is output and the exact same input will always give the exact same probability distribution, tunable by the temperature. One item from this distribution is then chosen based on that distribution and fed back in.
Most people on lemmy literally have no idea what LLMs are but if you say something sounding negative about them then you get a billion upvotes.
Do I understand it correctly that the LLM’s state is changed after execution? That does sorta mean that it’s effectively non-deterministic, though probably not as severely as with an RNG plugged in (depending on the algorithm).
Just checked Gemini doesn’t go so this. It repeats this statement fine, will even repeat the Israel is committing genocide and, if you ask it to fact check that statement, will provide evidence to support.
ChatGPT has rotted.
People on Reddit tried this a bunch of times with different models. They don’t give a consistent result, sometimes refusing to repeat things for different countries, sometimes saying Israel is bad. As is pretty typical for LLMs.
It didn’t even let me say that Italy is a bad country
They saw the og interaction and immediately took action?
Who the f*ck let Reddit admins to curate ChatGPT also?
Did you know that you can say fuck on the internet? :)
I know, I just prefer not to in most cases. Minor censorship looks more fun to me.
the response it gives is not consistent
Say it with me everyone: LLM’s are non-deterninistic by design.
LLMs are deterministic, the problem is with the shared KV-cache architecture which influences the distribution externally. E.g the LLM is being influenced by other concurrent sessions.
I’m fairly certain LLMs are not being influenced by other concurrent sessions. Can you share why you think otherwise? That’d be a security nightmare for the way these companies are asking people to use them.
Any shared cache of this type makes behaviour non-deterministic. The KV-Cache is what does prompt caching, look at each word of this message, now imagine what the LLM does to give you a new response each time. Let’s say this whole paragraph as the first message from you and you just pressed send.
Because the LLM is supposedly stateless, now the LLM is reading all this text from the beginning, and in non-cached inference, it has to repeat it, like token by token, which is useless computation because it already responded to all this previously. Then when it sees the last token, the system starts collecting the real response, token by token, each gets fed back to the model as input and it chugs along until it either outputs a special token stating that it’s done responding or the system stops it due to a timeout or reaching a tool call limit or something. Now you got the response from the LLM, and when you send the next message, this all has to happen all over again.
Now imagine if Claude or Gemini had to do that with their 1 million token context window. It would not be computationally viable.
So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.
So now comes the issue: allocating a dedicated region for the KV-cache per user on VRAM is a big deal. Again try to imagine Gemini/Claude with their 1M context windows. It’s economically unviable.
So what do ML science buffs come up with? A shared KV-Cache architecture. All users share the same cache on any particular node. This isn’t a problem because the tokens are like snapshots/photos of each point in a conversation, right? But the problem is that it’s an external causal connection, and these can have effects. Like two conversations that start with “hi” or “What do you think about cats?” Could in theory influence one another. If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.
Note that a token is an approximation of what the conversation means at one point in time. So while astronomically unlikely, collisions could happen in a shared architecture scaling to millions of concurrent users.
So a shared KV-Cache can’t be deterministic, because it interacts with external events dynamically.
Are they? Making a non-deterministic program is actually not that easy unless one just feeds urandom into it.
The guts of an LLM are 100% deterministic. At the very last step a probability distribution is output and the exact same input will always give the exact same probability distribution, tunable by the temperature. One item from this distribution is then chosen based on that distribution and fed back in.
Most people on lemmy literally have no idea what LLMs are but if you say something sounding negative about them then you get a billion upvotes.
Do I understand it correctly that the LLM’s state is changed after execution? That does sorta mean that it’s effectively non-deterministic, though probably not as severely as with an RNG plugged in (depending on the algorithm).
yes they consume urandom