0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
edited by

Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated.


  • Atleast one GB of data.
  • Production log data of webserver.

Few of them which I found so far:

  1. Wikipedia dump

  2. http://wiki.freebase.com/wiki/Data_dumps

  3. http://aws.amazon.com/publicdatasets/

Also can we run our own crawler to gather data from sites e.g. Wikipedia? Any pointers on how to do this is appreciated as well.

1 Answer

0 votes
by (32.5k points)
edited by

First of all, talking about adding your own crawler to get data and the Wikipedia dump dataset that you have found, I would like to add a point:

Since you are linked to the wikipedia data dumps, you can use the Bespin project to work with this data in Hadoop.

Now, I would suggest you this pool of large datasets to experiment with Hadoop, you can choose any category of a dataset based on your choice:


If you want to know more about Hadoop, here you can refer the following video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !