|Works with||Hadoop & MapReduce||Apache Spark|
Mahout supports four main data science use cases:
Learn Mahout in 9 hrs. Download e-book now
The Mahout project was started by several people involved in the Apache Lucene (open source search) community with an active interest in machine learning and a desire for robust, well-documented, scalable implementations of common machine-learning algorithms for clustering and categorization. The community was initially driven by Ng et al.’s paper “Map-Reduce for Machine Learning on Multicore” (see Resources) but has since evolved to cover much broader machine-learning approaches. Mahout also aims to:
Although relatively young in open source terms, Mahout already has a large amount of functionality, especially in relation to clustering and CF. Mahout’s primary features are:
Below is a current list of machine learning algorithms exposed by Mahout.
The next major version, Mahout 1.0, will contain major changes to the underlying architecture of Mahout, including:
Download latest questions asked on Mahout in top MNC's ?
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount – I would consider Mahout as serious alternative.
Getting Mahout to scale effectively isn’t as straightforward as simply adding more nodes to a Hadoop cluster. Factors such as algorithm choice, number of nodes, feature selection, and sparseness of data — as well as the usual suspects of memory, bandwidth, and processor speed — all play a role in determining how effectively Mahout can scale. To motivate the discussion, I’ll work through an example of running some of Mahout’s algorithms on a publicly available data set of mail archives from the Apache Software Foundation (ASF) using Amazon’s EC2 computing infrastructure and Hadoop, where appropriate.Each of the subsections after the Setup takes a look at some of the key issues in scaling out Mahout and explores the syntax of running the example on EC2.SetupThe setup for the examples involves two parts: a local setup and an EC2 (cloud) setup. To run the examples, you need:
To get set up locally, run the following on the command line:
This should get all the code you need compiled and properly installed. Separately, download the sample data, save it in the scaling_mahout/data/sample directory, and unpack it (tar -xf scaling_mahout.tar.gz). For testing purposes, this is a small subset of the data you’ll use on EC2.
To get set up on Amazon, you need an Amazon Web Services (AWS) account (noting your secret key, access key, and account ID) and a basic understanding of how Amazon’s EC2 and Elastic Block Store (EBS) services work. Follow the documentation on the Amazon website to obtain the necessary access.
With the prerequisites out of the way, it’s time to launch a cluster. It is probably best to start with a single node and then add nodes as necessary. And do note, of course, that running on EC2 costs money. Therefore, make sure you shut down your nodes when you are done running.
To bootstrap a cluster for use with the examples in the article, follow these steps:
1. Download Hadoop 0.20.203.0 from an ASF mirror and unpack it locally.
2. cd hadoop-0.20.203.0/src/contrib/ec2/bin
3. Open hadoop-ec2-env.sh in an editor and:
4. Open hadoop-ec2-init-remote.sh in an editor and:
Note: If you want to run classification, you need to use a larger instance and more memory. I used double X-Large instances and 12GB of heap.
5. Launch your cluster:
./hadoop-ec2 launch-cluster mahout-clustering X
X is the number of nodes you wish to launch (for example, 2 or 10). I suggest starting with a small value and then adding nodes as your comfort level grows. This will help control your costs.
6. Create an EBS volume for the ASF Public Data Set (Snapshot: snap–17f7f476) and attach it to your master node instance (this is the instance in the mahout-clustering-master security group) on /dev/sdh. (See Resources for links to detailed instructions in the EC2 online documentation.)
a. If using the EC2 command line APIs (see Resources), you can do:
b. Otherwise, you can do this via the AWS web console.
7. Upload the setup-asf-ec2.sh script (see Download) to the master instance:
./hadoop-ec2 push mahout-clustering $PATH/setup-asf-ec2.sh
8. Log in to your cluster:
./hadoop-ec2 login mahout-clustering
9. Execute the shell script to update your system, install Git and Mahout, and clean up some of the archives to make it easier to run:
With the setup details out of the way, the next step is to see what it means to put some of Mahout’s more popular algorithms into production and scale them up. I’ll focus primarily on the actual tasks of scaling up, but along the way I’ll cover some questions about feature selection and why I made certain choices.