Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Big Data Hadoop & Spark by (11.4k points)
edited by

Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated.

Prefernce:

  • Atleast one GB of data.
  • Production log data of webserver.

Few of them which I found so far:

  1. Wikipedia dump

  2. http://wiki.freebase.com/wiki/Data_dumps

  3. http://aws.amazon.com/publicdatasets/

Also can we run our own crawler to gather data from sites e.g. Wikipedia? Any pointers on how to do this is appreciated as well.

1 Answer

0 votes
by (32.3k points)
edited by

First of all, talking about adding your own crawler to get data and the Wikipedia dump dataset that you have found, I would like to add a point:

Since you are linked to the wikipedia data dumps, you can use the Bespin project to work with this data in Hadoop.

Now, I would suggest you this pool of large datasets to experiment with Hadoop, you can choose any category of a dataset based on your choice:

http://www.open-bigdata.com/category/big-data-datasets-experiment/

If you want to know more about Hadoop, here you can refer the following video tutorial:

Browse Categories

...