Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package.

These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the other one does not.

By the way ame is true about RandomForestModel.

Why are there two packages?


 

1 Answer

0 votes
by (32.3k points)

org.apache.spark.mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API.

  • spark.mllib carries the original API built on top of RDDs.

  • spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.

However, the spark.ml is considered as the recommended package because with DataFrames the API is more versatile and flexible. But users will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features as for existing algorithms not all of the functionality has been ported over to the new Spark ML API. But it is expected to have more features in the coming time.

In Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark has now changed to the DataFrame-based API in the spark.ml package. Now mllib is slowly getting deprecated(this already happened in case of linear regression) and most probably will be removed in the next major release.

Browse Categories

...