Top Answers to Mahout Interview Questions
|Works with||Hadoop & MapReduce||Apache Spark|
Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes.
Once big data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. The Apache Mahout project aims to make it faster and easier to turn big data into big information.
Mahout supports four main data science use cases:
- Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations)
- Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other
- Classification – learns from existing categorizations and then assigns unclassified items to the best category
- Frequent item-set mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together.
The Mahout project was started by several people involved in the Apache Lucene (open source search) community with an active interest in machine learning and a desire for robust, well-documented, scalable implementations of common machine-learning algorithms for clustering and categorization. The community was initially driven by Ng et al.’s paper “Map-Reduce for Machine Learning on Multicore” (see Resources) but has since evolved to cover much broader machine-learning approaches. Mahout also aims to:
- Build and support a community of users and contributors such that the code outlives any particular contributor’s involvement or any particular company or university’s funding.
- Focus on real-world, practical use cases as opposed to bleeding-edge research or unproven techniques.
- Provide quality documentation and examples.
Although relatively young in open source terms, Mahout already has a large amount of functionality, especially in relation to clustering and CF. Mahout’s primary features are:
- Taste CF. Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.
- Several Mapreduce enabled clustering implementations, including k-Means, fuzzy k-Means, Canopy, Dirichlet, and Mean-Shift.
- Distributed Naive Bayes and Complementary Naive Bayes classification implementations.
- Distributed fitness function capabilities for evolutionary programming.
- Matrix and vector libraries.
- Examples of all of the above algorithms.
Become Master of Apache Mahout by going through this online Mahout training.
Unless you are highly proficient in Java, the coding itself is a big overhead. There’s no way around it, if you don’t know it already you are going to need to learn Java and it’s not a language that flows! For R users who are used to seeing their thoughts realized immediately the endless declaration and initialization of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.
Below is a current list of machine learning algorithms exposed by Mahout.
- Collaborative Filtering
- Item-based Collaborative Filtering
- Matrix Factorization with Alternating Least Squares
- Matrix Factorization with Alternating Least Squares on Implicit Feedback
- Naive Bayes
- Complementary Naive Bayes
- Random Forest
- Canopy Clustering
- k-Means Clustering
- Fuzzy k-Means
- Streaming k-Means
- Spectral Clustering
- Dimensionality Reduction
- Lanczos Algorithm
- Stochastic SVD
- Principal Component Analysis
- Topic Models
- Latent Dirichlet Allocation
- Frequent Pattern Matching
The next major version, Mahout 1.0, will contain major changes to the underlying architecture of Mahout, including:
- Scala: In addition to Java, Mahout users will be able to write jobs using the Scala programming language. Scala makes programming math-intensive applications much easier as compared to Java, so developers will be much more effective.
- Spark & h2o: Mahout 0.9 and below relied on MapReduce as an execution engine. With Mahout 1.0, users can choose to run jobs either on Spark or h2o, resulting in a significant performance increase.
The main difference will came from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific – from the difference in per job overhead
If Your ML algorithm mapped to the single MR job – main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Let’s assume that we need 100 iterations, each needed 5 seconds of cluster CPU.
- On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
- On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount – I would consider Mahout as serious alternative.
- Adobe AMP uses Mahout’s clustering algorithms to increase video consumption by better user targeting.
- Accenture uses Mahout as typical example for their Hadoop Deployment Comparison Study
- AOL use Mahout for shopping recommendations. See slide deck
- Booz Allen Hamilton uses Mahout’s clustering algorithms. See slide deck
- Buzzlogic uses Mahout’s clustering algorithms to improve ad targeting
- Cull.tv uses modified Mahout algorithms for content recommendations
- DataMine Lab uses Mahout’s recommendation and clustering algorithms to improve our clients’ ad targeting.
- Drupal users Mahout to provide open source content recommendation solutions.
- Evolv uses Mahout for its Workforce Predictive Analytics platform.
- Foursquare uses Mahout for its recommendation engine.
- Idealo uses Mahout’s recommendation engine.
- InfoGlutton uses Mahout’s clustering and classification for various consulting projects.
- Intel ships Mahout as part of their Distribution for Apache Hadoop Software.
- Intela has implementations of Mahout’s recommendation algorithms to select new offers to send tu customers, as well as to recommend potential customers to current offers. We are also working on enhancing our offer categories by using the clustering algorithms.
- iOffer uses Mahout’s Frequent Pattern Mining and Collaborative Filtering to recommend items to users.
- Kauli , one of Japanese Ad network, uses Mahout’s clustering to handle click stream data for predicting audience’s interests and intents.
- Linked.In Historically, we have used R for model training. We have recently started experimenting with Mahout for model training and are excited about it – also see Hadoop World slides .
- LucidWorks Big Data uses Mahout for clustering, duplicate document detection, phrase extraction and classification.
- Mendeley uses Mahout to power Mendeley Suggest, a research article recommendation service.
- Mippin uses Mahout’s collaborative filtering engine to recommend news feeds
- Mobage uses Mahout in their analysis pipeline
- Myrrix is a recommender system product built on Mahout.
- NewsCred uses Mahout to generate clusters of news articles and to surface the important stories of the day
- Next Glass uses Mahout
- Predixion Software uses Mahout’s algorithms to build predictive models on big data
- Radoop provides a drag-n-drop interface for big data analytics, including Mahout clustering and classification algorithms
- ResearchGate, the professional network for scientists and researchers, uses Mahout’s recommendation algorithms.
- Sematext uses Mahout for its recommendation engine
- SpeedDate.com uses Mahout’s collaborative filtering engine to recommend member profiles
- Twitter uses Mahout’s LDA implementation for user interest modeling
- Yahoo! Mail uses Mahout’s Frequent Pattern Set Mining.
- 365Media uses Mahout’s Classification and Collaborative Filtering algorithms in its Real-time system named UPTIME and 365Media/Social.
- Dicode project uses Mahout’s clustering and classification algorithms on top of HBase.
- The course Large Scale Data Analysis and Data Mining at TU Berlin uses Mahout to teach students about the parallelization of data mining problems with Hadoop and Mapreduce
- Mahout is used at Carnegie Mellon University, as a comparable platform to GraphLab
- The ROBUST project , co-funded by the European Commission, employs Mahout in the large scale analysis of online community data.
- Mahout is used for research and data processing at Nagoya Institute of Technology , in the context of a large-scale citizen participation platform project, funded by the Ministry of Interior of Japan.
- Several researches within Digital Enterprise Research Institute NUI Galway use Mahout for e.g. topic mining and modeling of large corpora.
- Mahout is used in the NoTube EU project.
Getting Mahout to scale effectively isn’t as straightforward as simply adding more nodes to a Hadoop cluster. Factors such as algorithm choice, number of nodes, feature selection, and sparseness of data — as well as the usual suspects of memory, bandwidth, and processor speed — all play a role in determining how effectively Mahout can scale. To motivate the discussion, I’ll work through an example of running some of Mahout’s algorithms on a publicly available data set of mail archives from the Apache Software Foundation (ASF) using Amazon’s EC2 computing infrastructure and Hadoop, where appropriate.Each of the subsections after the Setup takes a look at some of the key issues in scaling out Mahout and explores the syntax of running the example on EC2.SetupThe setup for the examples involves two parts: a local setup and an EC2 (cloud) setup. To run the examples, you need:
- Apache Maven 3.0.2 or higher.
- Git version-control system (you may also wish to have a Github account).
- A *NIX-based operating system such as Linux or Apple OS X. Cygwin may work for Windows®, but I haven’t tested it.
To get set up locally, run the following on the command line:
- mkdir -p scaling_mahout/data/sample
- git clone git://github.com/lucidimagination/mahout.git mahout-trunk
- cd mahout-trunk
- mvn install (add a -DskipTests if you wish to skip Mahout’s tests, which can take a while to run)
- cd bin
- /mahout (you should see a listing of items you can run, such as kmeans)
This should get all the code you need compiled and properly installed. Separately, download the sample data, save it in the scaling_mahout/data/sample directory, and unpack it (tar -xf scaling_mahout.tar.gz). For testing purposes, this is a small subset of the data you’ll use on EC2.
To get set up on Amazon, you need an Amazon Web Services (AWS) account (noting your secret key, access key, and account ID) and a basic understanding of how Amazon’s EC2 and Elastic Block Store (EBS) services work. Follow the documentation on the Amazon website to obtain the necessary access.
With the prerequisites out of the way, it’s time to launch a cluster. It is probably best to start with a single node and then add nodes as necessary. And do note, of course, that running on EC2 costs money. Therefore, make sure you shut down your nodes when you are done running.
To bootstrap a cluster for use with the examples in the article, follow these steps:
1. Download Hadoop 0.20.203.0 from an ASF mirror and unpack it locally.
2. cd hadoop-0.20.203.0/src/contrib/ec2/bin
3. Open hadoop-ec2-env.sh in an editor and:
- Fill in your AWS_ACCOUNT_ID,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,EC2_KEYDIR, KEY_NAME, and PRIVATE_KEY_PATH. See the Mahout Wiki’s “Use an Existing Hadoop AMI” page for more information (see Resources).
- Set the HADOOP_VERSION to 0.20.203.0.
- Set S3_BUCKET to 490429964467.
- Set ENABLE_WEB_PORTS=true.
- Set INSTANCE_TYPE to m1.xlarge at a minimum.
4. Open hadoop-ec2-init-remote.sh in an editor and:
- In the section that creates hadoop-site.xml, add the following property:
Note: If you want to run classification, you need to use a larger instance and more memory. I used double X-Large instances and 12GB of heap.
- Change mapred.output.compress to false.
5. Launch your cluster:
./hadoop-ec2 launch-cluster mahout-clustering X
X is the number of nodes you wish to launch (for example, 2 or 10). I suggest starting with a small value and then adding nodes as your comfort level grows. This will help control your costs.
6. Create an EBS volume for the ASF Public Data Set (Snapshot: snap–17f7f476) and attach it to your master node instance (this is the instance in the mahout-clustering-master security group) on /dev/sdh. (See Resources for links to detailed instructions in the EC2 online documentation.)
a. If using the EC2 command line APIs (see Resources), you can do:
- i. ec2-create-volume –snapshot snap-17f7f476 –z ZONE
- ii. ec2-attach-volume $VOLUME_NUMBER -i $INSTANCE_ID -d /dev/sdh, where $VOLUME_NUMBER is output by the create-volume step and the $INSTANCE_ID is the ID of the master node that was launched by the launch-cluster command
b. Otherwise, you can do this via the AWS web console.
7. Upload the setup-asf-ec2.sh script (see Download) to the master instance:
./hadoop-ec2 push mahout-clustering $PATH/setup-asf-ec2.sh
8. Log in to your cluster:
./hadoop-ec2 login mahout-clustering
9. Execute the shell script to update your system, install Git and Mahout, and clean up some of the archives to make it easier to run:
With the setup details out of the way, the next step is to see what it means to put some of Mahout’s more popular algorithms into production and scale them up. I’ll focus primarily on the actual tasks of scaling up, but along the way I’ll cover some questions about feature selection and why I made certain choices.