Thanks for the link to Common Crawl; I didn't know about that project but it looks interesting.
That's also an interesting point about heavily curated data sets. Would something like that be able to overcome some of the bias in current models? For example, if you were training a facial recognition model, access a curated, open source dataset that has representative samples of all races and genders to try and reduce the racial bias. Anyone training a facial recognition model for any purpose could have a training set that can be peer reviewed for accuracy.
'"It is not humanitarian at all because it only serves one segment of the population there. The hostages there do not receive any humanitarian aid.”
They mean the civilians in Gaza held hostage by the Israeli military, right?