ChatGPT's hallucination problem is getting worse according to OpenAI's own tests and nobody understands why : technology

[-] [email protected] 72 points 2 months ago

Who knew the self reinforcing markov chain would do this?

Literally anyone with a brain in compsci 5 years ago? Nah that's "hallucinations".

[-] [email protected] 27 points 2 months ago

I don’t know what a Markov chain is nor do I know dick about computer science so correct me if I’m wrong, but I’m assuming the ai is picking up its own shitty data and regurgitating it but worse? This just over and over again?

The more I think about this the ai is just going to destroy the internet completely huh?

[-] [email protected] 12 points 2 months ago

Basically, you're correct. All the rest below is just me blabbin'.

I read a bit about markov chains a few times, and we talked about them in a few of my comp sci classes, but I'm not an expert or anything. But the basic explanation of them is that its a function that returns the probability which state should occur next. A fancy way of saying "What the odds of X, Y, or Z happening".

The classic example is you feed the chain something like a book, or all the books you have. Then you can ask what's likely to come after a word, and it will either spit out a single word or it gives you a list of the likelihood of all words it "knows". Texting apps, etc, word suggestion algorithms most likely used them.

You can train them on some sample data, then "customize" it so it "learns" from what the end user types. So my phone likes to say one person's name after I type "hi" because I've typed "hi x" a lot. You can also have them "look back" more words, classic chains only look at the most current input. They're much more complex and the time-space-energy required to compute them, so in general look back has only extended a few dozen words (in general).

All of that might sound familiar if you've heard much about how LLMs work. Because (in my opinion) its the same. I haven't seen a compelling argument that LLMs/Machine Learning aren't reducible to Markov Chains yet. (In the same way all computers are reducible to Turing Machines. Both machines suffer the same limits, one just hides some of the math so a normal human can use them.)

That isn't to say they're more powerful (LLMs can look back a few hundred words fairly ok), but they suffer the same limits Markovs inherently do. IE: The output is only as good as the inputs, deciding the output is subjective (because you can choose either the most common, or pick randomly, or ...), and they fundamentally don't "know" anything beyond their state machine. They can't verify. They can't research. As of right now, both systems can't even look at their original inputs. (And if they included those inputs, that'd take on a few terabytes of data.) All those books, text messages, reddit posts, all reduced to relations of words represented by probabilities.

I swear feeding a Markov's output to itself was discussed in one of my classes, and the professor was like, yeah it improves it a bit, but if you do it more then X amount then the whole thing falls apart. This was before 2020, in undergrad classes. People 100% said the same about LLMs before they started posting online, and now the silicon valley dweebs are hitting the same problem. I swear tech bros, besides being the most hiterite prone, love to recreate disasters. Same happened with "crypto" shit and basic financial schemes.

TLDR: Fuck "ai". My AI class sucked because it was in Java, but at least we were making a silly game AI (specific game withheld for privacy, but it was 20+ years old). If they're not talking about min-maxing and search trees and master agents, and instead are pushing this crap, comp sci is overcooked.

[-] [email protected] 7 points 2 months ago

It is not true that a pretrained transformer is reducible to a Markov chain in the same way that a computation is reducible to a turing machine. While any computation could be achieved by a turing machine (given arbitary time), not every pretrained transformer could be described by a (sufficiently complicated) Markov chain of arbitrary length.

One reason is the attention mechanism, which allows a transformer to weight some tokens in a sequence as differently than others.

But the biggest difference is that a Markov chain can only consider an entire state at once, and produce the next state sequentially, while transformers can consider many parts of a sequence as individual states. Transformers are also capable of autoregression (they're basically just RNNs+attention heads), while Markov chains are not - new states transform old states, and do not simply append.

Now, if you take the attention mechanism out of the transformer, you're basically just left with an RNN, and if you then take the autoregression out of the RNN, you've got a Markov chain in the limit, probably. So you could argue a Markov chain is in the limit of an LLM (kind of obvious, since you should expect an LLM trained on a Markov chain to predict it totally), but you could never argue an LLM can be produced in the limit of a Markov chain (you can train any arbitrarily large Markov chain and it will never get anywhere close to an LLM).

[-] [email protected] 12 points 2 months ago* (last edited 2 months ago)

It happened way fucking quicker than I expected

[-] [email protected] 10 points 2 months ago

a Markov chain predicts the next state based on the current state. If today is sunny, how likely is it that tomorrow will be rainy? Mathematically, this can be reduced to a Markov chain (so we don't have to take into account the season, weather patterns or anything like that for this example).

But a Markov chain isn't just saying how likely it is to be rainy on a given day, but how likely it is to be rainy tomorrow based on today. If today is sunny, there's a let's say 70% chance that tomorrow will be rainy. If today is rainy, there's a 40% chance that tomorrow will be rainy (and conversely a 60% chance that tomorrow will be sunny because possible states must always equal 100%).

Autocorrect works similarly. It predicts the next word based on the current word you've typed out. LLMs are kinda glorified markov chains because they also predict words (called tokens, which are about 3 to 4 characters) but they do it over a much larger "current state", which is the chat history, custom instructions if you gave any on chatgpt, etc. The context that is passed on with your prompt consists of several tokens and the AI generates one token at a time until little by little it's formed a full response that it outputs.

In this way the markov chain of LLM is if I give it the sentence "Hello! My name is" for example, it will predict which token is the most likely to follow and should output it. We can assume this should be a name but truthfully we don't know the exact probabilities of the next state. If I give it "Hello, my name is" - changing just one character might also change the prediction weighting. I say "might" because AI is a black box and we don't really see what happens when the data passes through the neurons.

However if you send that sentence to chatGPT it will correctly tell you that your message got cut off and asks you to finish it. They do some post-production fine-tuning to get it to do that. Compare to deepseek without the reasoning model:

[-] [email protected] 5 points 2 months ago

ouroboros

[-] [email protected] 47 points 2 months ago

According to the article, it's a bigger problem for the "reasoning models" than for the older-style LLMs. Since those explicitly break problems down into multiple smaller steps, I wonder if that's creating higher hallucination rates because each step introduces the potential for errors/fabrications. Even a very small amount of "cognitive drift" might have a very large impact on the final answer if it compounds across multiple steps.

[-] [email protected] 21 points 2 months ago* (last edited 2 months ago)

AI alchemists discovered that the statistics machine will be in a better ball park if you give it multiple examples and clarifications as part of your asks. This is called Chain of Thought prompting. Example:

Then the AI Alchemists said, hey we can automate this by having the model eat more of it's own shit. So a reasoning model will ask it self "What does the user want when they say < Your prompt>?" This will generate text that it adds to your query, to generate the final answer. All models with "chat memory" effectively eat their own shit, the tech works by reprocessing the whole chat history (sometimes there's a cache) every time you reply. Reasoning models because of the emulation of chain of thought eat more of their own shit than non-reasoning models do.

Some reasoning models are worse than others because some refeed the entire history of the reasoning, and others only refeed the current prompt's reasoning.

Essentially it's a form of compound error.

[-] [email protected] 6 points 2 months ago

welll the model are always refeeding their own output back into the model recurrently, CoT prompting works by explicitly having the model write out the intermediate steps to reduce logical jumps via the writing model. The production of the text of the reasoning model is still happening statistically, so its still prone to hallucination. my money is on the higher hallucination rate being a result of the data being polluted with synthetic information. I think its model collapse

[-] [email protected] 6 points 2 months ago

Another point of anecdata is that I've read that vibe coders say that non-reasoning models lead to better results for coding tasks because they are faster and they tend to hallucinate less because they don't pollute with automated CoT. I've seen people recommend Deepseek V3 03/2025 release (with deep think turned off) over R1 for that reason.

[-] [email protected] 5 points 2 months ago* (last edited 2 months ago)

my money is on the higher hallucination rate being a result of the data being polluted with synthetic information. I think its model collapse

But that is effectively what happening with RLMs and refeed. LLMs have statistical weights between model and inputs. For example RAG models will add higher weights to the text retrieved from your source documents. RLM reasoning is a fully automated CoT prompting technique. You don't provide the chain, you don't ask the LLM to create the chain, it just does it all the time for everything. Meaning the inputs becomes more polluted with generated text which reinforces the existing biases in the model.

For example if we take the em dash issue, the idea is that LLMs already generate more em dashes than exist in human written text. Let's say turn 1 you get an output with em dashes. On Turn 2 this is fed back into the machine which reinforces that over indexing on em dashes in your prompt. This means turn 2's output is going to potentially have more em dashes, because the input on turn 2 contained output from turn 1 that had more em dashes than normal. Your input over time end up accumulating the model's biases through the history. The shorter your inputs on each turn and the longer the conversation the faster the conversation input converges on being mostly LLM generated text.

When you do this with an RLM, you have even more output being added to the input automatically with a CoT prompt. Meaning that any model biases accumulate in the input even faster.

Another reason I suspect the CoT refeed vs training data pollution is that GPT-4.5 which is the latest (Feb 2025) non-reasoning model seems to have a lower hallucination rate on SimpleQA than o1. If the training data were the issue we'd see rates closer to o3/o4.

The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1's reasoning is not user accessible, and it's purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the next turn's input. On an o1 model that's happening once per turn only for output and then being accumulated.

[-] [email protected] 2 points 2 months ago

But asking the LLM to provide the CoT doesnt pollute the prompt history anymore vs the policy being tuned via RL or SFT techniques to generate the chain of thought - the chain of thought is still being generated. I have seen it proposed to use automated-COT-prompting s.t. that CoT examples are automatically injected into the prompt, but I havent been able to find information on whether this is actually implemented in any of the widely available reasoning models.

I'm not saying this can't, or doesn't affect it significantly as the trajectory of the policy goes wayyyyy outside its normal operating bounds but I dont think thats whats happening here (and after digging into it I dont think its model collapse either). if the hallucination rate is increasing across the board regardless of conversation length then I dont think thats necessarily a result of CoT, but indicative of an issue with the trained policy itself, which might be because of fine-tuning. Especially when you consider that the SimpleQA metric is short-form fact answering.

And with GPT4.5 performance, its also larger than even GPT4 which had 1.76T params (16-expert MOE model), and o1 has something like 300b. o1's accuracy/consistency on SimpleQA also outperforms GPT4o still, and has a lower hallucination rate, but 4o is smaller anyway at ~200b. source for param counts

As it turns out after doing research, o3-mini only has 3b parameters so I dont even think its model collapse, its just a tiny as hell model completely dedicated to math and science reasoning, which might be causing a "catastrophic forgetting" effect w.r.t. its ability to perform on other tasks as well, since it still outperforms GPT4.5 on math/sci reasoning metrics (But does shit on coding) and based on this metric https://github.com/vectara/hallucination-leaderboard it actually hallucinates the least w.r.t. document summarization. So maybe the performance on SimpleQA should be taken as a reflection of how much "common sense" is baked into the parameters. o3-mini and o4-mini both still outperform GPT4o-mini on SimpleQA despite being smaller, newer, and CoT models. And despite the higher hallucination rate for o4-mini, it also has a higher accuracy rate than o3-mini on SimpleQA, too from the simpleQA repo. So I dont even think this is telling us anything about CoT or dataset integrity. I think measuring hallucination for CoT vs a standard model will require a specific experiment on a base model tbh

I am also skeptical that the public reasoning and reasoning safeguards are a cause of the hallucinations beyond the same issues fine-tuning by RL. AFAIK neither of those change the prompt at all, public reasoning just exposes the CoT (which is still part of the output) and the reasoning safeguards are trained into the responses or by evaluation of the input by a separate model. So I dont think theres additional turns or prompt rewriting being injected at all (but I could be wrong on this, I had some trouble finding information).

Ugh. This is all machine woo. I need a drink

[-] [email protected] 3 points 2 months ago* (last edited 2 months ago)

Ah so ChatGPT works slightly different than what I'm used to. I have really only explored AI locally via ollama because I'm an OSS zealot and I need my tooling to be deterministic so I need to be able to eliminate service changes from the equation.

My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.

At this point my debug instincts have been convinced that my idea is unlikely.

Have we tried asking ChatGPT what's wrong before chain of thought prompting the benchmark data for each model to itself?

Ugh. This is all machine woo. I need a drink

I was trying to make a joke about this and was trying to remember the disease that tech priests get in Rogue Trader and Googles search AI hallucinated wh40k lore based on someone's home brew..... Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder..... To answer my question....I feel gross. Fucking man made horrors.

[-] [email protected] 2 points 2 months ago

My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.

This does happen (and i fucked with weird prompts for deepseek a lot, with very weird results) and I think it does cause what you described but like... the COT would get reprocessed in models without think tags too just by normal CoT prompting, and I also would just straight up get other command tokens outputted in even on really short prompts with minimal CoT. So I kind of attributed it to issues with local deepseek being as small as it is. I can't find the paper but naive CoT prompting works best with models that are already of a sufficient size, but the errors do compound on smaller models with less generalization. Maybe something you could try would be parsing the think tags to remove the CoT before re-injection? I was contemplating doing this but I would have to set ollama up again.

Its tough to say. I think an ideal experiment in my mind would be to measure hallucination rate in a baseline model, a baseline model with CoT prompting, and the same baseline model tuned by RL to do CoT without prompting. I would also want to measure hallucination rate with conversation length separately for all of those models. And I would also want to measure hallucination rate with/without CoT reinjection into chat history for the tuned CoT model. And also measuring hallucination rate across task domains with task-specific finetuning...

Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder

yikes

[-] [email protected] 3 points 2 months ago

Information isnt attached to the users query, the CoT still happens in the output of the model like in the first example that you gave. This can be done without any finetuning on the policy, but reinforcement learning can also be used to encourage the chat output to break the problem down in to "logical" steps. chat models have always passed in the chat history back into the next input while appending the users turn, thats just how they work (I have no idea if o1 passes the CoT into the chat history though, so i cant comment). But it wouldnt solely account for the massive degradation of performance between o1 and o3/o4

[-] [email protected] 4 points 2 months ago* (last edited 2 months ago)

From my other comment about o1 and o3/o4 potential issues:

The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1’s reasoning is not user accessible, and it’s purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the input. On an o1 model that's happening once per turn only for output and then being accumulated.

[-] [email protected] 2 points 2 months ago

Interesting. If that's right, it makes a lot of sense that models with this kind of recursive style would generate errors at a much higher rate. If you're taking everything in the session so far as an input and there's some chance for every input that the model produces an error, the errors will rapidly stack up with this kind of functionality. I've seen those time-lapses of how far generative AI can drift over 100 (or whatever) iterations of "reproduce this photo without making any changes" type prompts, with the output of each generation fed back in as input. This strikes me as the same kind of problem, but with text.

[-] [email protected] 4 points 2 months ago* (last edited 2 months ago)

It happens faster to images because of the way LLMs work. LLM's work on "tokens", a token for text is typically a character, fragment of a word, a word, a fragment of a sentence. With language it's much easier to encode meaning and be more precise because that's what language already does. The reason NLP is/was difficult is because language is not algorithmically consistent, it evolves, and rules are constantly broken. For example Kai Cenat is credited with more contributions to the English language than the vast majority of people because children decided to talk like him. Point being is that language does the heavy lifting in terms of encoding a string of characters into something meaningful.

With images, it's a whole different ball game. Image tokenizers often work in several different ways, there are two types of token hard and soft. Hard tokens for example could be the regions of the image, part could be the colors, the alpha channels.

Hard tokens are also the visual encoders of meaning so a chair, table, or car could be a hard tokens based on their bounding boxes. These tokenization techniques are based in a lot of other types of machine learning.

Note that these tokens often overlap in practice and consume regions of other tokens, however as "hard" tokens they are considered distinct entities, and this is where the trouble starts esp. for image generation (that's roughly why a lot of AI did and still does things like draw extra fingers).

The next type of tokens are soft tokens, and they're a bit harder to explain, but basically the idea is that soft tokens are encoded by detecting continual statistical distributions within images. It's a very abstract way of reading an image. Here's where the trouble compounds.

So now when we're writing an image, what do we write the image with? You guessed it. Tokens. The reason that those AI drift time lapses exist is because LLMs are statistical and not "functional". They don't have the mathematical concept of "identity". Otherwise they'd try to recreate the same image by copying the data in the exact tokens (or just copy the image itself) , instead they try to regenerate the image by generating new tokens with the same attributes that it read from the image.

To illustrate this lets say an image contains a blue car and the AI can only tokenize it as blue car. Asking an LLM to run an identity function on that image will result in a different car because the resolution of the token is only like 2 dimensions "blue" and "car" which roughly means it will output the average "blue" "car" from its training data. Now with human made things it's actually a lot easier. There's a finite variation of cars. However there's an infinite variation of things that can happen to a car. So an AI theoretically can run an identity function off of a particular make/model/year of a vehicle but if the paint is scratched or the paint job is unique it will start to introduce drift there's also other sources of drift like camera angle etc. With natural objects this becomes a whole different ball game because of the level of variation, this complexity compounds with scenes.

So identity functions on text are extremely easy in comparison for example:

This works because the tokens are simpler and there is less of a loss of "resolution" from the text to the tokenize form. E.g. word "Poopy" is token "poopy". But once you get into interpreting an image, and re-encoding those interpretations onto a new canvas it becomes much more difficult. e.g. image of "Dwayne the Rock Johnson" is most likely a series of tokens like "buff man", "bad actor", etc.

This is a rough explanation because there's a lot of voodoo, and I'm more of a Software Engineer than I am a statistics/data guy so I approach the alchemy a little bit from an alchemical standpoint.

[-] [email protected] 14 points 2 months ago

That's plausible. I suspect that the whole reinforcement learning step the models use only gets you so far. I think that neurosymbolic approach is actually more promising. The idea there is pretty clever. You use a deep neural network to parse noisy data from the outside world, and classify it. Then you use a symbolic logic engine to operate on classified data, and now you have actual reasoning happening within the system.

[-] [email protected] 5 points 2 months ago

The few times I've had a local one share the "reasoning", the machine mainly just ties itself in knots over trivial nonsense for thousands of words before ignoring all of that and going with the first answer it came up with. Machine God is a Redditor.

[-] [email protected] 3 points 2 months ago* (last edited 2 months ago)

its also bc theyre... newer. The damn data's gone sour! (the data for the newer models probably includes a lot of synthetic crap)

[-] [email protected] 43 points 2 months ago* (last edited 2 months ago)

"hallucination" was the dumbest successful marketing term for "being so incredibly incorrect that it's not even recognisable as reality"

[-] [email protected] 42 points 2 months ago

Its training on ai generated content

[-] [email protected] 31 points 2 months ago

You'd expect them to control for that.

But who the hell knows with these people.

[-] [email protected] 46 points 2 months ago

How could they? The Internet is flooded with that garbage, and not just from their own model.

[-] [email protected] 21 points 2 months ago

Yeah, they would have to have a sure fire way to identify the content as 100% AI generated so they could ditch it. If they're training on Reddit Comments then they're fucked.

[-] [email protected] 19 points 2 months ago

To be fair, they do thanks the gold for the stranger fuck CCP bacon.

[-] [email protected] 9 points 2 months ago* (last edited 2 months ago)

ChatGPT doesn't have access to verified not-bots. Google and Facebook does (they can read all of your messages). Expect them to become the privacy-invasive leads on this.

[-] [email protected] 4 points 2 months ago

Exactly it.

Meta will be able to have bots dominate their platforms while being able to distinguish and train on the real human users.

Human interaction with bots will also allow labelling some bot generated data as acceptible for reuse as training data when real human data isn't enough.

Social media platforms will have a massive advantage for LLMs in the long run. Glorified search engine platforms such as ChatGPT are a relic of this current era.

[-] [email protected] 3 points 2 months ago

They could use human filters to discard obvious bots... but that would make this shit even more expensive and unprofitable so they have to cut corners and make their own project fail.

[-] [email protected] 9 points 2 months ago* (last edited 2 months ago)

ding ding ding! The models are collapsing

(e: or maybe its just cause these specific models are tiny as hell and trained for a specific task lol)

[-] [email protected] 41 points 2 months ago* (last edited 2 months ago)

I need a burning "AI" emoji. Quotation marks because it's not AI and AI isn't real.

AI and crypto have turned me into a goddam luddite.

[-] [email protected] 30 points 2 months ago

It's less that AI isn't real and more that it's a nebulous marketing term.

A videogame enemy is AI, a spam classifier is AI, a computer vision motion alarm is AI, an extremely convincing text emitter is AI.

I honestly can't understand why people use these things outside of applications for transforming text you created where you are checking the output. Their whole thing is learning how to output text which is an extremely convincing facsimile of human writing. Like their whole thing, literally just the thing they do and optimise for, is forging being a thinking human being. That's quite useful for say summarising a body of text you wrote, or helping you soften the tone of something, or draft rhymes or something. If you use them to learn anything though, or produce something you don't entirely understand inside that output will be things with only a limited relationship with reality and every single piece of it will look extremely convincing.

Like holy shit, even putting aside all other concerns you are exposing yourself to a specialised misleading machine. Why not just take a hammer to your own head while you're at it?

[-] [email protected] 17 points 2 months ago

the luddites were cool

[-] [email protected] 35 points 2 months ago

Because the Internet is quickly becoming an ouroboros of LLM shit

[-] [email protected] 4 points 2 months ago

Dead internet theory, now with necromancy!

[-] [email protected] 31 points 2 months ago

Jesus Christ, 50% or more is hallucination.

[-] [email protected] 27 points 2 months ago

The tech hasn't gotten much better since gpt 3.5 and everything we see is the result of fine tuning, which uses human input to bias the model (openai is 80% a proxy of human labor).

[-] [email protected] 24 points 2 months ago

Habsburg AI confirmed

[-] [email protected] 15 points 2 months ago

Synthetic data probably

[-] [email protected] 1 points 2 months ago

Easy, because you can't feed a system shit it generated in a virtual ouroboros meets AI centipede. Dataraters are encouraged to do research, how you do that? Search engines, what's those search engines use for the top of its search results? AI generated poopoopeepee. The circle of filth continues. LM generates shit notices patterns, if all it notices is its own patterns with diminishing human input don't whine when its highly refined waste.