Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:

spark.mllib, built on top of RDDs.

spark.ml, built on top of Dataframes.

According to this and this question on StackOverflow, Dataframes are better than RDDs and should be used whenever possible.

The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.

If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?

What's the missing point here?

1 Answer

0 votes
by (33.1k points)

You can simply use Spark 2.0.0

Currently, Spark moves strongly Spark 2.0.0

Spark moves completely towards DataFrame API with the ongoing deprecation of RDD API. While a number of native "ML" algorithms are growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0 with the ongoing deprecation of RDD API. While a number of native "ML" algorithms are growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.

Check this for more details-Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0.

Hope this answer helps you! For more details, study the Apache Spark Tutorial.

Browse Categories

...