This is fun — Jamie Callan‘s group at CMU LTI just finished a crawl of 1 billion web pages. It’s 5 terabytes compressed — big enough so they have to send it to you by mailing hard drives.
One of their motivations was to have a corpus large enough such that research results on it would be taken seriously by search engine companies. To my mind, this begs the question whether academics should try to innovate in web search, when it’s a research area incredibly dependent on really large, expensive-to-acquire datasets. And what’s the point? To slightly improve Google someday? Don’t they do that pretty well themselves?
On the other hand, having a billion web pages around sounds like a lot of fun. Someone should get Amazon to add this to the AWS Public Datasets. Then, to process the data, instead of paying to get 5 TB of data shipped to you, you instead pay Amazon to rent virtual computers that can access the data. This costs less only to a certain point, of course.
It always seemed to me that a problem with Amazon’s public datasets program is that they want data that’s genuinely large enough you need to rent lots of computing power to work on it; but there are very few public datasets large enough to warrant that. (For example, they have Freebase up there, but I think it’s slightly too small to justify that; e.g. I can fit all of freebase just fine on my laptop and run a grep on it in like 5 minutes flat.) But 1 billion web pages is more arguably appropriate for this treatment.
The bigger problem with big-data research initiatives is that organizations with petabyte-scale data are always going to keep it private; e.g. from giant corporations — walmart retail purchase records, or the facebook friend graph, or google search query logs — or else from governments of course. Maybe biology and computational genetics is the big exception to this tendency. At least the public data situation for web research just got a lot better.
hadoop dfs -put
The dataset is costly, but 80% of that is the 1.5 TB x 4 disks that come with it (this is judging by what Best Buy currently has on sale). The remaining $150 or so strikes me as fairly inexpensive for a crawl that took an enormous amount of bandwidth, computing power, planning & maintenance. Jamie’s groups is currently working on providing the link-graph with the dataset — another huge chunk of processing that researchers using the dataset won’t need to perform.
Jon: yes, it’s great.
Totally agree with you on this one. Outside of bioinformatics and maybe physics, it is tough to get large datasets that can be publicly redistributed and used for commercial purposes.
Compared to the datasets you get access to in industry, most public datasets are tiny. It is like pulling teeth to get any large research, government, or industry datasets released so that they are publicly redistributable.
Some of the concerns I’ve run into:
* copyright (web crawls, image datasets, audio, product metadata)
* privacy concerns (collaborative filtering, government data, geo data, legal/pacer)
* competitive reasons (financial data, retail data, commerce, website stats)
I think we need a data stimulus plan :) Like the Enron email dataset on a larger scale, we should have open access to factual data from the financial markets, geodata, bankrupt companies, and other locked down data that the government has ties to. Pumping that government subsidized data out on the net could spur a lot of innovation which is something we need right now.
i need dataset that each record be web page,how can i get it and use it?
eelnazz: go to the link I posted in the blog post.