Category Posts Navigation

Common Crawl

Posted by Marcus Zillman

Common Crawl
http://commoncrawl.org/

The Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Their vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. They level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy. The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3. As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves. They build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone. Here is years of free web page data to help you change the world! This will be added to Bot Research Subject Tracer™. This will be added to World Wide Web Reference Subject Tracer™. This will be added to Business Intelligence Resources Subject Tracer™. This will be added to Entrepreneurial Resources Subject Tracer™. This will be added to the tools section of Research Resources Subject Tracer™. This will be added to Web Data Extractors white paper.

Leave a Reply

Facebook Comments

Sign up for Awareness Watch

* = required field

Browse Categories