Curated data should improve the results as well. Just jamming all the trash data is why the models they keep jamming into everything have such trash results.
the LLM's dataset uses only public domain and openly licensed material.
I'm curious about the specifics of all this. Probably the most well-known "openly licensed" sort of licenses (aside from licenses specifically intended only for software) are the Creative Commons family of licenses, all of which require attribution. So then the question would become "if you've used any of my CC-licensed content in training this model, am I attributed somewhere?" If so, surely the list is extremely long. Or maybe Creative Commons wasn't "openly"-enough licensed and they excluded all CC-licensed content from the training set.
Also, the public domain is definitely strongly biased toward very old content. You'd think a lot of the answers you got from that LLM would be based on some very outdated information. Maybe they specifically limited it to (or at least adjusted weights or something to make it prefer) recent materials in the public domain.
But then the article also says:
It performed about as well as Meta's similarly sized Llama 2-7B from 2023.
On top of all this, I have to say that the LLM sphere really is just scams piled on top of scams, so it's fairly probable either that it doesn't perform anywhere near as well as Llama 2-7B and they're just lying or that actually Llama 2-7B (and indeed all LLMs as well) is just total shit too.
You can but the results are going to be essentially unusable compared to SOTA. You are still giving the big AI companies a massive monopoly just so the big copyright companies can make even more money.
Only one solution is good for us. In both cases, the ones that actually created the data get screwed though.
Technology
Which posts fit here?
Anything that is at least tangentially connected to the technology, social media platforms, informational technologies and tech policy.
Post guidelines
[Opinion] prefix
Opinion (op-ed) articles must use [Opinion] prefix before the title.
Rules
1. English only
Title and associated content has to be in English.
2. Use original link
Post URL should be the original link to the article (even if paywalled) and archived copies left in the body. It allows avoiding duplicate posts when cross-posting.
3. Respectful communication
All communication has to be respectful of differing opinions, viewpoints, and experiences.
4. Inclusivity
Everyone is welcome here regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
5. Ad hominem attacks
Any kind of personal attacks are expressly forbidden. If you can't argue your position without attacking a person's character, you already lost the argument.
6. Off-topic tangents
Stay on topic. Keep it relevant.
7. Instance rules may apply
If something is not covered by community rules, but are against lemmy.zip instance rules, they will be enforced.
Companion communities
[email protected]
[email protected]
Icon attribution | Banner attribution
If someone is interested in moderating this community, message @[email protected].