this post was submitted on 11 Oct 2024
484 points (99.2% liked)

Technology

59434 readers
2976 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 52 points 1 month ago* (last edited 1 month ago) (2 children)

The Wayback Machine’s site has been breached, but its founder says the data is still there.

One concern I do have that's maybe worth considering is that The Wayback Machine is often used as an authoritative source of what a website was like at some point. Like, if you're citing information, it's considered appropriate to link to The Wayback Machine.

There are entities who would potentially be interested in being able to modify that authoritative history.

I don't think that that's likely an issue here -- someone who wanted to secretly modify the history probably wouldn't have also modified the site to indicate that it was compromised -- but the ability to modify such a representation might have a lot of potential room for abuse.

It might be worthwhile, if the infrastructure permits for it, to do some sort of storage mechanism that makes it hard to spoof old data.

If you're familiar with blockchains, they leverage a chain of hashes, so that there's a piece of data dependent on all prior entries. That sort of dependency didn't originate with blockchain -- the CBC cipher mode does the same thing, off the top of my head -- and I don't think that a fully-distributed mode of operation is required here.

However, it might be interesting to use some sort of verifiable storage format where hashes of checkpoints are distributed elsewhere, so that if someone does manage to get into The Internet Archive, they can't go fiddle with past things without it becoming visible.

Git repositories take advantage of this with their signed commits and hash trees.

If someone gets into The Internet Archive, they could potentially compromise a copy before it gets hashed (though if they supported the submitter signing commits, a la git, that'd avoid that for information that originated from somewhere other than The Internet Archive). This can't protect against that. But it can protect the integrity of information archived prior to the compromise, which could be rather important.

[–] [email protected] 6 points 1 month ago (2 children)

However, it might be interesting to use some sort of verifiable storage format where hashes of checkpoints are distributed elsewhere, so that if someone does manage to get into The Internet Archive, they can't go fiddle with past things without it becoming visible.

Why not use a write only medium, like CDs but obviously bigger capacity. Write once read many kind of thing. It's an archive, it should not be able to be changed.

[–] [email protected] 1 points 1 month ago

write only medium

I guess you meant "write once"?


Anyway, this won't prevent attacks that somehow swap the CD being read, or the backend logic for where to read the data from.

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago)

That's a thought, though my guess is that access time constraints for something like a CD might mean that it could at most be a secondary form of storage.

[–] [email protected] 2 points 1 month ago

You cited Git as an example, but in Git it's possible to e.g. force-push a branch and if someone later fetches it with no previous knowledge they will get the original version.

The problem is the "with non previous knowledge" and is the reason this isn't a storage issue. The way you would solve this in git would be to fetch a specific commit, i.e. you need to already know the hash of the data you want.

For the Wayback Machine this could be as simple as embedding that hash in the url. That way when someone tries to fetch that url in the future they know what to expect and can verify the website data matches the hash.

This won't however work if you don't already have such hash or you don't trust the source of it, and I don't think there's something that will ever work in those cases.