One of their motivations was to have a corpus large enough such that research results on it would be taken seriously by search engine companies. To my mind, this begs the question whether academics should try to innovate in web search, when it’s a research area incredibly dependent on really large, expensive-to-acquire datasets. And what’s the point? To slightly improve Google someday? Don’t they do that pretty well themselves?
On the other hand, having a billion web pages around sounds like a lot of fun. Someone should get Amazon to add this to the AWS Public Datasets. Then, to process the data, instead of paying to get 5 TB of data shipped to you, you instead pay Amazon to rent virtual computers that can access the data. This costs less only to a certain point, of course.
It always seemed to me that a problem with Amazon’s public datasets program is that they want data that’s genuinely large enough you need to rent lots of computing power to work on it; but there are very few public datasets large enough to warrant that. (For example, they have Freebase up there, but I think it’s slightly too small to justify that; e.g. I can fit all of freebase just fine on my laptop and run a grep on it in like 5 minutes flat.) But 1 billion web pages is more arguably appropriate for this treatment.
The bigger problem with big-data research initiatives is that organizations with petabyte-scale data are always going to keep it private; e.g. from giant corporations — walmart retail purchase records, or the facebook friend graph, or google search query logs — or else from governments of course. Maybe biology and computational genetics is the big exception to this tendency. At least the public data situation for web research just got a lot better.