Looks like it may be a combo of subs going dark then coming back plus the 1000 indexing limit as mentioned in https://kbin.social/m/RedditMigration/t/47320/PSA-If-you-have-more-than-1000-posts-more-than
Reddit Migration
### About Community Tracking and helping #redditmigration to Kbin and the Fediverse. Say hello to the decentralized and open future. To see latest reeddit blackout info, see here: https://reddark.untone.uk/
I like the solution someone gave, so instead of deleting the comments, we rewrite everything with garbage, because garbage data is worse than no data.
Here the link https://lemmy.world/comment/286942
Remember though most automated solutions can't overcome the 1000 index limit, even when overwriting. Even doing it manually may not do the job.
It's this a daily limit, or a request limit? Because we may only split the changes if necessary
Neither. It's an indexing limit. Basically you can only see the 1000 most recent posts, the 1000 most upvoted posts, and the 1000 most downvoted posts in a sub at best. (But there may also be overlap, so the total number of unique posts thru all three methods is less than 3000.) So you could do part the job on different days, have others help you in splitting the requests up, etc. None of it would help bypass that limit. It's like a limit on what you can see in the table of contents, but also if books didn't have page numbers and you couldn't get to a specific page unless you either found it in the TOC or else you had memorized the 19 digit access number.
I wrote about how to overcome it (see https://kbin.social/m/RedditMigration/t/65260/PSA-Here-s-exactly-what-to-do-if-you-hit-the ) but this only works for the comments and posts of the past. Now that pushshift was shutdown we won't have access to such data going forward.
Editing comments doesn't devalue the data for reddit since they still have all the original data, it's only problematic for people trying to scrape the data from the public UI who are the exact people reddit wants to charge big bucks for API access so idk it this is hurting them in the way people seem to think
The word is that reddit deletes are soft-deletes and a copy is still saved on reddit's database (but not publicly accessible to anyone outside of reddit), however the same word is that overwriting does in fact destroy reddit's copy of the original data.
As for people who want to get my posts and comments - this is exactly why I saved a copy of everything before overwriting. They can still find it, they will just have to scrape lemmy/kbin - or use the lemmy or kbin API - to get at it.
Also, the internet archive (who is registered as an actual library iirc) has a copy of the pushshift torrents covering all reddit posts and comments from 2005 to March 2023. So the librarians and historians who want to research this stuff will be fine.