this post was submitted on 06 May 2025
84 points (100.0% liked)

technology

23740 readers
88 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 4 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 6 points 3 days ago (2 children)

welll the model are always refeeding their own output back into the model recurrently, CoT prompting works by explicitly having the model write out the intermediate steps to reduce logical jumps via the writing model. The production of the text of the reasoning model is still happening statistically, so its still prone to hallucination. my money is on the higher hallucination rate being a result of the data being polluted with synthetic information. I think its model collapse

[–] [email protected] 6 points 2 days ago

Another point of anecdata is that I've read that vibe coders say that non-reasoning models lead to better results for coding tasks because they are faster and they tend to hallucinate less because they don't pollute with automated CoT. I've seen people recommend Deepseek V3 03/2025 release (with deep think turned off) over R1 for that reason.

[–] [email protected] 5 points 2 days ago* (last edited 2 days ago) (1 children)

my money is on the higher hallucination rate being a result of the data being polluted with synthetic information. I think its model collapse

But that is effectively what happening with RLMs and refeed. LLMs have statistical weights between model and inputs. For example RAG models will add higher weights to the text retrieved from your source documents. RLM reasoning is a fully automated CoT prompting technique. You don't provide the chain, you don't ask the LLM to create the chain, it just does it all the time for everything. Meaning the inputs becomes more polluted with generated text which reinforces the existing biases in the model.

For example if we take the em dash issue, the idea is that LLMs already generate more em dashes than exist in human written text. Let's say turn 1 you get an output with em dashes. On Turn 2 this is fed back into the machine which reinforces that over indexing on em dashes in your prompt. This means turn 2's output is going to potentially have more em dashes, because the input on turn 2 contained output from turn 1 that had more em dashes than normal. Your input over time end up accumulating the model's biases through the history. The shorter your inputs on each turn and the longer the conversation the faster the conversation input converges on being mostly LLM generated text.

When you do this with an RLM, you have even more output being added to the input automatically with a CoT prompt. Meaning that any model biases accumulate in the input even faster.

Another reason I suspect the CoT refeed vs training data pollution is that GPT-4.5 which is the latest (Feb 2025) non-reasoning model seems to have a lower hallucination rate on SimpleQA than o1. If the training data were the issue we'd see rates closer to o3/o4.

The other big difference between o1 and o3 and o4 that may explain the higher rate of hallucinations is that the o1's reasoning is not user accessible, and it's purposefully trained to not have safe guards on reasoning. Where o3 and o4 have public reasoning and reasoning safeguards. I think safeguards may be a significant source of hallucination because they change prompt intent, encoding and output. So on a non-o1 model that safeguard process is happening twice per turn once for reasoning and once for output, then being accumulated into the next turn's input. On an o1 model that's happening once per turn only for output and then being accumulated.

[–] [email protected] 2 points 2 days ago (1 children)

But asking the LLM to provide the CoT doesnt pollute the prompt history anymore vs the policy being tuned via RL or SFT techniques to generate the chain of thought - the chain of thought is still being generated. I have seen it proposed to use automated-COT-prompting s.t. that CoT examples are automatically injected into the prompt, but I havent been able to find information on whether this is actually implemented in any of the widely available reasoning models.

I'm not saying this can't, or doesn't affect it significantly as the trajectory of the policy goes wayyyyy outside its normal operating bounds but I dont think thats whats happening here (and after digging into it I dont think its model collapse either). if the hallucination rate is increasing across the board regardless of conversation length then I dont think thats necessarily a result of CoT, but indicative of an issue with the trained policy itself, which might be because of fine-tuning. Especially when you consider that the SimpleQA metric is short-form fact answering.

And with GPT4.5 performance, its also larger than even GPT4 which had 1.76T params (16-expert MOE model), and o1 has something like 300b. o1's accuracy/consistency on SimpleQA also outperforms GPT4o still, and has a lower hallucination rate, but 4o is smaller anyway at ~200b. source for param counts

As it turns out after doing research, o3-mini only has 3b parameters so I dont even think its model collapse, its just a tiny as hell model completely dedicated to math and science reasoning, which might be causing a "catastrophic forgetting" effect w.r.t. its ability to perform on other tasks as well, since it still outperforms GPT4.5 on math/sci reasoning metrics (But does shit on coding) and based on this metric https://github.com/vectara/hallucination-leaderboard it actually hallucinates the least w.r.t. document summarization. So maybe the performance on SimpleQA should be taken as a reflection of how much "common sense" is baked into the parameters. o3-mini and o4-mini both still outperform GPT4o-mini on SimpleQA despite being smaller, newer, and CoT models. And despite the higher hallucination rate for o4-mini, it also has a higher accuracy rate than o3-mini on SimpleQA, too from the simpleQA repo. So I dont even think this is telling us anything about CoT or dataset integrity. I think measuring hallucination for CoT vs a standard model will require a specific experiment on a base model tbh

I am also skeptical that the public reasoning and reasoning safeguards are a cause of the hallucinations beyond the same issues fine-tuning by RL. AFAIK neither of those change the prompt at all, public reasoning just exposes the CoT (which is still part of the output) and the reasoning safeguards are trained into the responses or by evaluation of the input by a separate model. So I dont think theres additional turns or prompt rewriting being injected at all (but I could be wrong on this, I had some trouble finding information).

Ugh. This is all machine woo. I need a drink

[–] [email protected] 3 points 2 days ago* (last edited 2 days ago) (1 children)

Ah so ChatGPT works slightly different than what I'm used to. I have really only explored AI locally via ollama because I'm an OSS zealot and I need my tooling to be deterministic so I need to be able to eliminate service changes from the equation.

My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.

At this point my debug instincts have been convinced that my idea is unlikely.

Have we tried asking ChatGPT what's wrong before chain of thought prompting the benchmark data for each model to itself?

Ugh. This is all machine woo. I need a drink

I was trying to make a joke about this and was trying to remember the disease that tech priests get in Rogue Trader and Googles search AI hallucinated wh40k lore based on someone's home brew..... Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder..... To answer my question....I feel gross. Fucking man made horrors.

[–] [email protected] 2 points 2 days ago

My experience is that with ollama and deepseek r1 it reprocess the think tags. they get referenced directly.

This does happen (and i fucked with weird prompts for deepseek a lot, with very weird results) and I think it does cause what you described but like... the COT would get reprocessed in models without think tags too just by normal CoT prompting, and I also would just straight up get other command tokens outputted in even on really short prompts with minimal CoT. So I kind of attributed it to issues with local deepseek being as small as it is. I can't find the paper but naive CoT prompting works best with models that are already of a sufficient size, but the errors do compound on smaller models with less generalization. Maybe something you could try would be parsing the think tags to remove the CoT before re-injection? I was contemplating doing this but I would have to set ollama up again.

Its tough to say. I think an ideal experiment in my mind would be to measure hallucination rate in a baseline model, a baseline model with CoT prompting, and the same baseline model tuned by RL to do CoT without prompting. I would also want to measure hallucination rate with conversation length separately for all of those models. And I would also want to measure hallucination rate with/without CoT reinjection into chat history for the tuned CoT model. And also measuring hallucination rate across task domains with task-specific finetuning...

Not only that it hallucinated the characters back story that's not even in the post to give them a genetic developmental disorder

yikes