Technology

59559 readers

3339 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

134

Child safety org launches AI model trained on real child sex abuse images (arstechnica.com)

submitted 1 day ago by [email protected] to c/[email protected]

85 comments fedilink hide all child comments

Today, a prominent child safety organization, Thorn, in partnership with a leading cloud-based AI solutions provider, Hive, announced the release of an AI model designed to flag unknown CSAM at upload. It's the earliest AI technology striving to expose unreported CSAM at scale.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 43 points 1 day ago (2 children)

I am a bit confused how it is legal for them to have the training data here?

Like is there anything a corpo can't do?

Like why can't subway Jared and Catholic church "train the AI"

Only half way joking, what's the catch here?

[–] [email protected] 0 points 7 hours ago

I don't think you even need the actual stuff to train a neural network to recognize it. For example, if I wanted to train a neural network to recognize pictures of lions, but I didn't have any actual pictures of lions, I could use pictures of lion-shaped things, lion-colored things and locations where lions might appear. If a picture is hitting all three of those, it's very likely to be a lion. Very likely is all a neural network can do, so it's good enough for my purposes.

[–] [email protected] 33 points 1 day ago (1 children)

There are laws around it. Law enforcement doesn't just delete any digital CSAM they seize.

Known CSAM is archived and analyzed rather than destroyed, and used to recognize additional instances of the same files in the wild. Wherever file scanning is possible.

Institutions and corporation can request licenses to access the database, or just the metadata that allows software to tell if a given file might be a copy of known CSAM.

This is the first time an attempt is being made at using the database to create software able to recognize CSAM that isn't already known.

I'm personally quite sceptical of the merit. It may well be useful for scanning the public internet, but I'm guessing the plan is to push for it to be somehow implemented for private communication, no matter how badly that compromises the integrity of encryption.

[–] [email protected] 13 points 1 day ago (1 children)

So doesn't that make the law enforcement having the biggest CP collection from everybody? This sounds kinda dangerous...

[–] [email protected] 21 points 1 day ago* (last edited 1 day ago)

It does. Kinda.

The police are seldom allowed to be in possession of CSAM, except for in terms of grabbing the hardware which contains it in an arrest. The database used in modern detection tools is maintained by NCMEC which has special permission to do so.

And of course there are risks, but it's just digital data. Unless you are creating more, you're not actively harming anyone. And law enforcement absolutely needs that data to take some of the most obvious steps to prevent it being spread further.

Obviously, someone has access, but to get to the actual media files wouldn't be simple. What typically happens, is that anyone wanting to detect CSAM, is given a hashed version of the database. They can then scan their systems for CSAM by hashing any media they are hosting, and seeing whether there are any matches.

Whenever possible, people aren't handling the actual media. But for any detection to be possible to begin with, the database of the actual media does need to be maintained somewhere.

AI is a touchier subject, as you can't train a model to recognize CSAM not already in the database using hashes, so in those cases you have to work with actual real media. This is only recently becoming a thing.

It also leaves open the possibility for false positives. An oft cited example is parents taking pictures of their own children for innocent reasons, or doctors and parents handling images for valid medical reasons. In a system that flagged such content, it would mean someone else would be seeing that "private" content because it was flagged.