this post was submitted on 16 Dec 2023
22 points (76.2% liked)

No Stupid Questions

35779 readers
989 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)


Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.



Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.



Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.



Rule 4- No self promotion or upvote-farming of any kind.

That's it.



Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.



Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.



Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.



Rule 8- All comments should try to stay relevant to their parent content.



Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.



Rule 10- Majority of bots aren't allowed to participate here.



Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 1 year ago
MODERATORS
 

I know we can't do this with any copyrighted materials. But a lot of books, music, art, knowledge is in the creative commons. Is it possible to create one massive torrent that includes all that can be legally included and then have people only download what they actually want to enjoy?

top 20 comments
sorted by: hot top controversial new old
[–] [email protected] 17 points 11 months ago* (last edited 11 months ago) (2 children)

So like… a meta-torrent that is a torrent of all the other torrents?

Edit: or we could just create a website that had a list of all the torrents. And give it a clever name that reminds us of the fact that we’re pirating things. Oh… wait

[–] [email protected] 4 points 11 months ago

With the way that the BitTorrent v2 protocol works, each file of the original, underlying torrents wouldn't have to be re-seeded, but rather would reuse each file's individual hash and thus incorporate those files into the meta torrent without necessarily having to download or even upload any part of the meta torrent.

That said, the .bittorrent file would be massive and might run up against certain limits in the current protocol.

[–] [email protected] 2 points 11 months ago

But don't lose the list. Losing the list would be bad. We'll need to keep the list in a safe place.

[–] [email protected] 14 points 11 months ago (1 children)

I mean... That pretty much describes torrents period... What is the functional difference between hosting a single torrent with everything, and hosting a torrent per item?

If the expectation is that you only include files you want when downloading the torrent, you're only going to be seeding that portion.

Seems like it would just make the search function harder, and make it harder to determine the "health" of individual items...

I don't understand the benefit...

[–] [email protected] 1 points 11 months ago (1 children)

For example zlibrary is 220TB of books and scientific articles, that included in the torrent would be great along with all the stuff that is arts and music.

Basically it would be a way to combat media vanishing off of the net over time. Basically a noah's arche for all of mankinds knowledge.

It would be great to have everything in one single spot to make it easier to contribute and get stuff. We'd also be more easily capable of combining our forces to maintain/create the thing.

[–] [email protected] 2 points 11 months ago (1 children)

That makes a great focal point for what I was saying actually ;)

It's 220TB, so youll have incredibly few people who download the whole torrent. Most will open the torrent list and select a small number of items from it to download. The files selected the most will get seeded frequently, the ones that never get selected by anyone will have only the originator seeding it (if they continue to do so).

It's functionally no different than if each individual file is a torrent... Except that the seeding info is going to be wonky on the single 220TB torrent because nobody is downloading it intact, only in pieces.

It's also much easier to find a specific file if it is it's own torrent vs. one of a billion files in a single mega torrent.

Just because you put it on an index in a torrent doesn't mean the file still exists somewhere. That media can still vanish...

What would do what you're suggesting this torrent would do (which a torrent cannot actually do) is a Yottabyte capable computer somewhere storing all those files... You'd need that to keep the seeding intact for the whole torrent...

[–] [email protected] 1 points 11 months ago (1 children)

Maybe one could tweak BitTorrent for this one mega torrent so that you have to seed/leech one of the least popular files along with every popular file?

[–] [email protected] 2 points 11 months ago

Ok.. well now we're getting crazy :)

A much better approach to what you're talking about with that one is probably to approach the problem from the other end of the snake.

Torrents work at keeping files intact communally specifically because they're popular files, and the more popular, the more "healthy" a torrent is, because it's transitting more often and being stored in chunks in a bunch of places.

If you're trying to keep an archive of everything (and frankly, what I'm about to suggest could literally store the whole ass internet), you need to focus on the obscure crap nobody is likely to ever look for... The stuff that can't survive over torrent because it's obscure.

You can do that by share, similar to a torrent, but you wouldn't want a setup that encouraged users to share files, you'd want a setup that encourages users to share storage.

Like you provide a hypothetical tnerrot network (made up just now, torrent backwards) and as a condition of using this tnerrot network you allow say 20GB (or more, as internet gets bigger, drives get bigger, games get bigger, this allocation can get bigger as that happens...) on your hard drive that it uses to store the actual files, and in exchange you can pull any file stored in the tnerrot network. Instead of marvel movies (or whatever legal file has that kind of oomph) having a billion seeds and obscure science report having one, everything would have 2 or 3 dedicated seeds because every file would be seeded by whatever computers (2 or 3 separate ones, for redundancy) tnerrot stores it at.

You'd need a few commercial servers, because hosting a file that gets thousands of download requests a day wouldn't be friendly for random guy in Ohio or wherever, but for the vast vast majority of the files, you shouldn't have major issues.

Space sharing, not file sharing, is what you'd need to do what you're thinking. You'd need to invent the tnerrot...

[–] [email protected] 8 points 11 months ago

Not quite what you’re getting at, but the entirety of Wikipedia without images is available as a 20-30GB download: https://en.wikipedia.org/wiki/Wikipedia:Database_download

[–] [email protected] 6 points 11 months ago

You're describing leeching from something like Anna's Archive datasets.

[–] [email protected] 4 points 11 months ago

You could seed the Torrents by the internet archive.

[–] [email protected] 4 points 11 months ago

I think there’s a handful of problems with the idea. For starters (I’m just going with the first returned result because the actual numbers don’t matter as much as the magnitudes), there’s around 64 zetabytes comprising the internet as of 2020, 64 trillion GB. That’s going to be one hell of a zip file. In fact, pretty much the only thing capable of storing that much information is, well, the Internet itself.

Second is the rate of information being produced. These estimates vary wildly, but the rate of growth is increasing exponentially. We will soon be writing more data per day to the internet than is currently there from the very beginning until now.

So maybe we don’t need every product page from every store website around the world. Maybe we don’t need the tens of millions of pages of corporate training manuals. Maybe we need curation rather than SELECT * FROM INTERNET.

That’s what things like Gutenberg and the Internet Archive do. They’re very limited in what they catch, of course. It’s also sort of what Wikipedia does, although curation here includes summarization. It’s also a feature of historical archives from existing media - like New York Times records that go back a century (or wherever they’re at now), or back issues of Nature and Science going back to when they started publication. Those are obviously doable - people are doing them - but each alone is a microscopic piece of the puzzle.

So, given that those exist, alongside the rest of the internet, what value are we creating? Storing something digitally doesn’t give it permanence, and I have an 8” floppy disk for a cash register POS created by an unknown OS to prove it.

Someday (hopefully soon) PDFs will go away and nothing will read them. Hell, the concept of “file” could go away in 50 years. There are written texts from thousands of years ago that we cannot read, and others we’ve deciphered only very recently and imperfectly. All of that archived stuff will have to be ported over, and again that’s going to mean yet more curation. At the rate information is growing you’re going to make Sisyphus look like he’s on a vacation in Tahiti.

Does that mean it’s all one big library of Alexandria? Not necessarily.

Rather than thinking of all those data as a library, think of them as an ecosystem of knowledge. Once Amazon goes out of business, no one’s going to care about that one page of theirs with the nose hair trimmer. We will still have a copy of the NYT when we landed on the moon, or when Nazi Germany was defeated. We’ll also have other information about space programs and 20th century history. We probably won’t have my mom’s recipes or all those pictures I’ve taken of my pets over the years, and my MySpace page is thankfully gone forever. I even deleted all of my Reddit content before moving on.

Maybe my scientific publications will end up archived someplace, but there we get into the tree falling in the forest problem. If no one reads them from now to the end of time, are they really there? Maybe physically, but they’ve sort of passed out of the ecosystem of human knowledge and are now part of the fossil record, if anything.

We’ve also researched how to communicate over millennia. There’s the (kind of silly but a little cool) Long Now project. We’ve also tried to invent symbology that will allow us to put warning signs outside hazardous/nuclear waste storage facilities that will continue to communicate “Danger - Do Not Enter” for tens of thousands of years.

In short, I think that the problem you’re trying to solve is impermanence or entropy, which both Buddhists and physicists will tell you aren’t things we’re going to solve.

[–] [email protected] 4 points 11 months ago

Pretty sure this stuff already exists in some form. /r/datahoarder people would probably be able to steer you in the right direction though you may need to lay out several thousand for enough HDDs to hold it.

[–] [email protected] 3 points 11 months ago
[–] [email protected] 3 points 11 months ago
[–] [email protected] 2 points 11 months ago

Scihub + libgen archives (torrent list) come pretty close, even that it's not what you're asking about (100TB? total)

[–] [email protected] 2 points 11 months ago (1 children)

"All" is impossible. You're going to miss something. And it's a lot of work. Maybe have a look at the datasets people/researchers use to train Artificial Intelligence. I think some people put in the effort to compile large datasets with just freely licensed data.

[–] [email protected] 2 points 11 months ago (1 children)

it’s a lot of work

so per your suggestion using for example the zlibrary book/paper repo and training sets of openai as starting point one could maybe get around the brunt of the work.

[–] [email protected] 2 points 11 months ago* (last edited 11 months ago)

ZLibrary isn't something that pays attention to licensing. It's mainly copyrighted and pirated material.

I meant something like the dump of wikipedia, project gutenberg, and whatever archive.org has available tagged with some favorable licenses.

I think there are datasets compiled with sources like those. I'm not an expert on this, something like RedPajama just without random web-scraping.

https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

[–] [email protected] 0 points 11 months ago