There have been a number of papers recently ("TinyStories" and "Textbooks are all you need" are two that come to mind) showing that higher quality training data for LLMs leads to significantly better models. So yes, this is technically true.
Best of Mastodon
Share the best posts you find on mastodon. Add a link to the original toot
Related communities:
Does anyone really need to read / publish papers to come to conclusion that better training data creates better models? Obvious no? Am I missing something here?
Yeah. Science is often about rigorously verifying the things that seem obvious, and often going into significantly more detail about it too. TinyStories for example not only confirms that better quality data gives better quality models, but also how to go about to get that data, what parameters of the models lead to what observed effects, and how all of that can be used in the future. The simple one line summary of the conclusion might be "better training data creates better models" but that does nothing to really explain what better quality actually means or how to go about getting that better quality data, or why the data used for previous models was worse quality.
This does seem to be a big revelation to ML/AI types. I was at a conference last year for NLP and one of the data scientists was reporting back on some other conference they'd been to where they had worked out that you don't get better performance by running your data through a dozen different models, but by running better data through a single model.
I guess @[email protected] is right that you need to confirm these things but it definitely spoke to quite a naive mindset.
This does seem to be a big revelation to ML/AI types
The big revelation is not that, but how to go about doing that for the very specific set of circumstances that we are currently dealing with. Just a cursory search on google scholar shows that even 20 years ago it was known that good data leads to better predictive models, but with more advanced models comes more complications as to what constitutes better quality data. The paper I linked is concerned with random errors which is important for simple regression models (which was state of the art back then) but now is pretty obvious and no one is really concerned about that. The TinyStories paper I mentioned in a previous comment is not so much concerned about better quality data, but what sort of changes do we need to make to the data used to train a model (like generate synthetic data consisting of short stories that only contain words that a typical 3 to 4-year-olds usually understand) to make it better suited towards getting that model to exhibit certain features (produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities) along with how tweaking certain parameters. We have know that better quality teaching leads to better quality learning in humans, but yet we are still actively researching how to better teach students.
I get that this is a complicated topic, but to represent it as "this is obvious idk why the scientist think this is new" is just being willfully ignorant to what is actually happening. It is never as simple as "running better data through a single model", especially considering that ensemble learning is a pretty well established means to get better outputs from the same exact data (which is important because it is very difficult to get better quality data). If you want to actually know what is going on rather than listening to the reinterpretation of (what is probably) a single work from a single conference by a participant of that conference (and not the author of that work), just read the paper. For context, the TinyStories paper is 27 pages long, the one sentence summary is not going to give you the full detail.
I took it the other way, instead of us teaching the machines, we should have the machines teaching us, or develop the machines to teach us at least.
What's the point of having all the information in the world at our fingertips if we're just going to ignore it in preference of what we already agree with?
Oh, I guess that's a valid interpretation. Especially since we are doing that too! I haven't been reading up on the intelligent tutoring systems side of AI/pedagogy in a while but I know that we have already attempted to teach math and computer science using them. I am technically working with a research team on creating a creativity support tool/tutoring system for analyst tasks using LLMs right now. Or I would be if I got off Lemmy 😆
It's true of humans too.
She ain't wrong.
Oh don't worry, we are still teaching. waves to chatgpt
I like this one !