0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
edited by

Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated.

Prefernce:

  • Atleast one GB of data.
  • Production log data of webserver.

Few of them which I found so far:

  1. Wikipedia dump

  2. http://wiki.freebase.com/wiki/Data_dumps

  3. http://aws.amazon.com/publicdatasets/

Also can we run our own crawler to gather data from sites e.g. Wikipedia? Any pointers on how to do this is appreciated as well.

1 Answer

0 votes
by (24.8k points)

First of all talking about adding your own crawler to get data and the wikipedia dump dataset that you have found, I would like to add a point:

Since you are linked to the wikipedia data dumps, you can use the Bespin project to work with this data in Hadoop.

Now, i would suggest you this pool of large datasets to experiment with hadoop, you can choose any category of dataset based on your choice:

http://www.open-bigdata.com/category/big-data-datasets-experiment/

...