this post was submitted on 06 Sep 2024
1721 points (90.1% liked)

Technology

59381 readers
3043 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.

This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.

Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 2 months ago

I never anthropomorphized the technology, unfortunately due to how language works it's easy to misinterpret it as such. I was indeed trying to explain overfitting. You are forgetting the fact that current AI technology (artificial neural networks) are based on biological neural networks. There is a range of quirks that it exhibits that biological neural networks do as well. But it is not human, nor anything close. But that does not mean that there are no similarities that can be rightfully pointed out.

Overfitting isn't just what you describe though. It also occurs if the prompt guides the AI towards a very specific part of it's training data. To the point where the calculations it will perform are extremely certain about what words come next. Overfitting here isn't caused by an abundance of data, but rather a lack of it. The training data isn't being produced from within the model, but as a statistical inevitability of the mathematical version of your prompt. Which is why it's tricking the AI, because an AI doesn't understand copyright - it just performs the calculations. But you do. And so using that as an example is like saying "Ha, stupid gun. I pulled the trigger and you shot this man in front of me, don't you know murder is illegal buddy?"

Nobody should be expecting a machine to use itself ethically. Ethics is a human thing.

People that use AI have an ethical obligation to avoid overfitting. People that produce AI also have an ethical obligation to reduce overfitting. But a prompt quite literally has infinite combinations (within the token limits) to consider, so overfitting will happen in fringe situations. That's not because that data is actually present in the model, but because the combination of the prompt with the model pushes the calculation towards a very specific prediction which can heavily resemble or be verbatim the original text. (Note: I do really dislike companies that try to hide the existence of overfitting to users though, and you can rightfully criticize them for claiming it doesn't exist)

This isn’t akin to anything human, people can’t repeat pages of text verbatim like this and no toddler can be tricked into repeating a random page from a random book as you say.

This is incorrect. A toddler can and will verbatim repeat nursery rhymes that it hears. It's literally one of their defining features, to the dismay of parents and grandparents around the world. I can also whistle pretty much my entire music collection exactly as it was produced because I've listened to each song hundreds if not thousands of times. And I'm quite certain you too have a situation like that. An AI's mind does not decay or degrade (Nor does it change for the better like humans) and the data encoded in it is far greater, so it will present more of these situations in it's fringes.

but it isn’t crafting its own sentences, it’s using everyone else’s.

How do you think toddlers learn to make their first own sentences? It's why parents spend so much time saying "Papa" or "Mama" to their toddler. Exactly because they want them to copy them verbatim. Eventually the corpus of their knowledge grows big enough to the point where they start to experiment and eventually develop their own style of talking. But it's still heavily based on the information they take it. It's why we have dialects and languages. Take a look at what happens when children don't learn from others: https://en.wikipedia.org/wiki/Feral_child So yes, the AI is using it's training data, nobody's arguing it doesn't. But it's trivial to see how it's crafting it's own sentences from that data for the vast majority of situations. It's also why you can ask it to talk like a pirate, and then it will suddenly know how to mix in the essence of talking like a pirate into it's responses. Or how it can remember names and mix those into sentences.

Therefore it is factually wrong to state that it doesn’t keep the training data in a usable format

If your arguments is that it can produce something that happens to align with it's training data with the right prompt, well yeah that's not incorrect. But it is so heavily misguided and borders bad faith to suggest that this tiny minority of cases where overfitting occurs is indicative of the rest of it. LLMs are a prediction machines, so if you know how to guide it towards what you want it to predict, and that is in the training data, it's going to predict that most likely. Under normal circumstances where the prompt you give it is neutral and unique, you will basically never encounter overfitting. You really have to try for most AI models.

But then again, you might be arguing this based on a specific AI model that is very prone to overfitting, while I am arguing this out of the technology as a whole.

This isn’t originality, creativity or anything that it is marketed as. It is storing, encoding and copying information to reproduce in a slightly different format.

It is originality, as these AI can easily produce material never seen before in the vast, vast majority of situations. Which is also what we often refer to as creativity, because it has to be able to mix information and still retain legibility. Humans also constantly reuse phrases, ideas, visions, ideals of other people. It is intellectually dishonest to not look at these similarities in human psychology and then treat AI as having to be perfect all the time, never once saying the same thing as someone else. To convey certain information, there are only finite ways to do so within the English language.