Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated.


  • Atleast one GB of data.
  • Production log data of webserver.

Few of them which I found so far:

  1. Wikipedia dump



Also can we run our own crawler to gather data from sites e.g. Wikipedia? Any pointers on how to do this is appreciated as well.

1 Answer

First of all, talking about adding your own crawler to get data and the Wikipedia dump dataset that you have found, I would like to add a point:

Since you are linked to the wikipedia data dumps, you can use the Bespin project to work with this data in Hadoop.

Now, I would suggest you this pool of large datasets to experiment with Hadoop, you can choose any category of a dataset based on your choice:

