50
submitted 2 days ago by [email protected] to c/[email protected]

AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could and it just takes more work. To prove it, AI researchers trained a model on a dataset that uses only public domain and openly licensed material.

What makes it difficult is curating the data, but once the data has been curated once, in principle everyone can use it without having to go through the painful part. So the whole "we have to violate copyright and steal intellectual property" is (as everybody already knew) total BS.

top 3 comments
sorted by: hot top new old
[-] [email protected] 12 points 2 days ago

Curated data should improve the results as well. Just jamming all the trash data is why the models they keep jamming into everything have such trash results.

[-] [email protected] 2 points 2 days ago

the LLM's dataset uses only public domain and openly licensed material.

I'm curious about the specifics of all this. Probably the most well-known "openly licensed" sort of licenses (aside from licenses specifically intended only for software) are the Creative Commons family of licenses, all of which require attribution. So then the question would become "if you've used any of my CC-licensed content in training this model, am I attributed somewhere?" If so, surely the list is extremely long. Or maybe Creative Commons wasn't "openly"-enough licensed and they excluded all CC-licensed content from the training set.

Also, the public domain is definitely strongly biased toward very old content. You'd think a lot of the answers you got from that LLM would be based on some very outdated information. Maybe they specifically limited it to (or at least adjusted weights or something to make it prefer) recent materials in the public domain.

But then the article also says:

It performed about as well as Meta's similarly sized Llama 2-7B from 2023.

On top of all this, I have to say that the LLM sphere really is just scams piled on top of scams, so it's fairly probable either that it doesn't perform anywhere near as well as Llama 2-7B and they're just lying or that actually Llama 2-7B (and indeed all LLMs as well) is just total shit too.

[-] [email protected] 1 points 2 days ago

You can but the results are going to be essentially unusable compared to SOTA. You are still giving the big AI companies a massive monopoly just so the big copyright companies can make even more money.

Only one solution is good for us. In both cases, the ones that actually created the data get screwed though.

this post was submitted on 06 Jun 2025
50 points (100.0% liked)

Technology

2945 readers
156 users here now

Which posts fit here?

Anything that is at least tangentially connected to the technology, social media platforms, informational technologies and tech policy.


Post guidelines

[Opinion] prefixOpinion (op-ed) articles must use [Opinion] prefix before the title.


Rules

1. English onlyTitle and associated content has to be in English.
2. Use original linkPost URL should be the original link to the article (even if paywalled) and archived copies left in the body. It allows avoiding duplicate posts when cross-posting.
3. Respectful communicationAll communication has to be respectful of differing opinions, viewpoints, and experiences.
4. InclusivityEveryone is welcome here regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
5. Ad hominem attacksAny kind of personal attacks are expressly forbidden. If you can't argue your position without attacking a person's character, you already lost the argument.
6. Off-topic tangentsStay on topic. Keep it relevant.
7. Instance rules may applyIf something is not covered by community rules, but are against lemmy.zip instance rules, they will be enforced.


Companion communities

[email protected]
[email protected]


Icon attribution | Banner attribution


If someone is interested in moderating this community, message @[email protected].

founded 2 years ago
MODERATORS