Explore Courses Blog Tutorials Interview Questions
+1 vote
in Big Data Hadoop & Spark by (1k points)

Considering a MySQL products database with 10 millions products for an e-commerce website.

I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.

I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib

  • So what is the difference between the two frameworks?
  • Mainly, what are the advantages,down-points and limitations of each?

2 Answers

0 votes
by (13.2k points)

The main difference lies in their framework. For Mahout,  it is Hadoop MapReduce and in the case of MLib, Spark is the framework.

Mahout has proven capabilities that Spark’s MlLib lacks.  Apache Mahout is mature and comes with many ML algorithms to choose from  and it is built atop MapReduce. So, it is constrained by disk accesses and  is slow. Because of this, it does not handle iterative jobs very well. Machine Learning algorithms use many iterations, so due to this iterative property Manhout runs very slowly. Whereas, MlLib is built on top of Spark, which makes it much faster than Mahout. But, Mahout is a much more stable and mature framework and is highly recommended if the size of data is huge. 

Get certification in Mahout by enrolling in Mahout Training.

0 votes
by (32.3k points)
MLlib is a unattached collection of high-level algorithms that runs on Spark. This is what Mahout used to be the only Mahout of old was on Hadoop MapReduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).

The most important thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and conditions including an interactive Scala shell. Perhaps the most relevant word is "generalized". Since it runs on Spark anything possible in MLlib can be applied with the linear algebra engine of Mahout-Spark.

If you need a common engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a particular algorithm, look at each to see what they have. For instance, Kmeans runs in MLlib but if you need to cluster A'A (a co-occurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A.

If you want more knowledge regarding Spark, refer the following video:

Browse Categories