Reddit Migration

16 readers

2 users here now

### About Community Tracking and helping #redditmigration to Kbin and the Fediverse. Say hello to the decentralized and open future. To see latest reeddit blackout info, see here: https://reddark.untone.uk/

founded 1 year ago

Overwriting Comments w/ AI Output Is the Quickest Way to Make Reddit's Data Useless to LLM Firms (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

3 comments fedilink hide all child comments

A new study shows that LLM models that are fed too much content that was generated by LLMs eventually collapse. Essentially, text generated by AI is poison if it makes its way into an LLMs training data. If the model eats too much of this poison, the model dies. By replacing your Reddit comments with AI generated text, you can effectively increase the toxicity of Reddit's dataset, and thereby decrease its value to firms training new LLMs. This will probably happen naturally anyway as spam bots and so forth continue taking over Reddit, but if you want to go out in a petty way, this is a good option.

I linked the actual study, but I first read about this on Platformer, where he was writing more broadly about how the AI is filing up the web with synthetic content and the problems that is causing. He was using this study to point out that it will be increasingly hard for developers to find good content for the LLMs to train on due to there being so much AI generated content, and the risk of the LLMs consuming too much AI content. Here is what he wrote:

A second, more worrisome study comes from researchers at the University of Oxford, University of Cambridge, University of Toronto, and Imperial College London. It found that training AI systems on data generated by other AI systems — synthetic data, to use the industry’s term — causes models to degrade and ultimately collapse.

While the decay can be managed by using synthetic data sparingly, researchers write, the idea that models can be “poisoned” by feeding them their own outputs raises real risks for the web.

And that’s a problem, because — to bring together the threads of today’s newsletter so far — AI output is spreading to encompass more of the web every day.

“The obvious larger question,” Clark writes, “is what this does to competition among AI developers as the internet fills up with a greater percentage of generated versus real content.”

When tech companies were building the first chatbots, they could be certain that the vast majority of the data they were scraping was human-generated. Going forward, though, they’ll be ever less certain of that — and until they figure out reliable ways to identify chatbot-generated text, they’re at risk of breaking their own models.

Even the study's abstract doesn't make a lot of sense to me, so here is an AI generated ELI5 (I am fully aware of the irony):

This paper is about how computers learn to write like humans. They use a lot of text from the internet to learn how to write. But if they use too much text that they wrote themselves, they start to forget how humans write. This is bad because we want computers to write like humans. So we need to make sure that computers learn from humans and not just from other computers.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here