Comments on: 1 billion web page dataset from CMU

By: brendano

brendano — Mon, 01 Aug 2011 17:33:42 +0000

eelnazz: go to the link I posted in the blog post.

By: eelnazz

eelnazz — Tue, 26 Jul 2011 04:06:49 +0000

i need dataset that each record be web page,how can i get it and use it?

By: Pete Skomoroch

Pete Skomoroch — Sat, 18 Apr 2009 01:17:02 +0000

Totally agree with you on this one. Outside of bioinformatics and maybe physics, it is tough to get large datasets that can be publicly redistributed and used for commercial purposes.

Compared to the datasets you get access to in industry, most public datasets are tiny. It is like pulling teeth to get any large research, government, or industry datasets released so that they are publicly redistributable.

Some of the concerns I’ve run into:

* copyright (web crawls, image datasets, audio, product metadata)
* privacy concerns (collaborative filtering, government data, geo data, legal/pacer)
* competitive reasons (financial data, retail data, commerce, website stats)

I think we need a data stimulus plan :) Like the Enron email dataset on a larger scale, we should have open access to factual data from the financial markets, geodata, bankrupt companies, and other locked down data that the government has ties to. Pumping that government subsidized data out on the net could spur a lot of innovation which is something we need right now.

By: brendano

brendano — Fri, 17 Apr 2009 17:51:35 +0000

Jon: yes, it’s great.

By: Jon

Jon — Fri, 17 Apr 2009 16:02:50 +0000

The dataset is costly, but 80% of that is the 1.5 TB x 4 disks that come with it (this is judging by what Best Buy currently has on sale). The remaining $150 or so strikes me as fairly inexpensive for a crawl that took an enormous amount of bandwidth, computing power, planning & maintenance. Jamie’s groups is currently working on providing the link-graph with the dataset — another huge chunk of processing that researchers using the dataset won’t need to perform.

By: roddy

roddy — Fri, 17 Apr 2009 07:17:27 +0000

hadoop dfs -put