<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: 1 billion web page dataset from CMU</title>
	<atom:link href="https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/feed/" rel="self" type="application/rss+xml" />
	<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/</link>
	<description>cognition, language, social systems; statistics, visualization, computation</description>
	<lastBuildDate>Tue, 25 Nov 2025 13:11:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: brendano</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-72450</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Mon, 01 Aug 2011 17:33:42 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-72450</guid>
		<description><![CDATA[eelnazz: go to the link I posted in the blog post.]]></description>
		<content:encoded><![CDATA[<p>eelnazz: go to the link I posted in the blog post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: eelnazz</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-71891</link>
		<dc:creator>eelnazz</dc:creator>
		<pubDate>Tue, 26 Jul 2011 04:06:49 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-71891</guid>
		<description><![CDATA[i need dataset that each record be web page,how can i get it and use it?]]></description>
		<content:encoded><![CDATA[<p>i need dataset that each record be web page,how can i get it and use it?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pete Skomoroch</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-5224</link>
		<dc:creator>Pete Skomoroch</dc:creator>
		<pubDate>Sat, 18 Apr 2009 01:17:02 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-5224</guid>
		<description><![CDATA[Totally agree with you on this one.  Outside of bioinformatics and maybe physics, it is tough to get large datasets that can be publicly redistributed and used for commercial purposes.  

Compared to the datasets you get access to in industry, most public datasets are tiny.  It is like pulling teeth to get any large research, government, or industry datasets released so that they are publicly redistributable. 

Some of the concerns I&#039;ve run into:

* copyright (web crawls, image datasets, audio, product metadata)
* privacy concerns (collaborative filtering, government data, geo data, legal/pacer)
* competitive reasons (financial data, retail data, commerce, website stats)

I think we need a data stimulus plan :)   Like the Enron email dataset on a larger scale, we should have open access to factual data from the financial markets, geodata, bankrupt companies, and other locked down data that the government has ties to.  Pumping that government subsidized data out on the net could spur a lot of innovation which is something we need right now.]]></description>
		<content:encoded><![CDATA[<p>Totally agree with you on this one.  Outside of bioinformatics and maybe physics, it is tough to get large datasets that can be publicly redistributed and used for commercial purposes.  </p>
<p>Compared to the datasets you get access to in industry, most public datasets are tiny.  It is like pulling teeth to get any large research, government, or industry datasets released so that they are publicly redistributable. </p>
<p>Some of the concerns I&#8217;ve run into:</p>
<p>* copyright (web crawls, image datasets, audio, product metadata)<br />
* privacy concerns (collaborative filtering, government data, geo data, legal/pacer)<br />
* competitive reasons (financial data, retail data, commerce, website stats)</p>
<p>I think we need a data stimulus plan :)   Like the Enron email dataset on a larger scale, we should have open access to factual data from the financial markets, geodata, bankrupt companies, and other locked down data that the government has ties to.  Pumping that government subsidized data out on the net could spur a lot of innovation which is something we need right now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brendano</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-5213</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Fri, 17 Apr 2009 17:51:35 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-5213</guid>
		<description><![CDATA[Jon: yes, it&#039;s great.]]></description>
		<content:encoded><![CDATA[<p>Jon: yes, it&#8217;s great.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-5210</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Fri, 17 Apr 2009 16:02:50 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-5210</guid>
		<description><![CDATA[The dataset is costly, but 80% of that is the 1.5 TB x 4 disks that come with it (this is judging by what Best Buy currently has on sale).  The remaining $150 or so strikes me as fairly inexpensive for a crawl that took an enormous amount of bandwidth, computing power, planning &amp; maintenance.  Jamie&#039;s groups is currently working on providing the link-graph with the dataset -- another huge chunk of processing that researchers using the dataset won&#039;t need to perform.]]></description>
		<content:encoded><![CDATA[<p>The dataset is costly, but 80% of that is the 1.5 TB x 4 disks that come with it (this is judging by what Best Buy currently has on sale).  The remaining $150 or so strikes me as fairly inexpensive for a crawl that took an enormous amount of bandwidth, computing power, planning &amp; maintenance.  Jamie&#8217;s groups is currently working on providing the link-graph with the dataset &#8212; another huge chunk of processing that researchers using the dataset won&#8217;t need to perform.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: roddy</title>
		<link>https://brenocon.com/blog/2009/04/1-billion-web-page-dataset-from-cmu/#comment-5191</link>
		<dc:creator>roddy</dc:creator>
		<pubDate>Fri, 17 Apr 2009 07:17:27 +0000</pubDate>
		<guid isPermaLink="false">http://anyall.org/blog/?p=482#comment-5191</guid>
		<description><![CDATA[hadoop dfs -put]]></description>
		<content:encoded><![CDATA[<p>hadoop dfs -put</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic page generated in 0.015 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2026-04-27 14:52:07 -->
