Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I noticed there are two LinearRegressionModel classes in SparkML, one in ML and another one in MLLib package.

These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the other one does not.

By the way ame is true about RandomForestModel.

Why are there two packages?


1 Answer

0 votes
by (32.3k points)

org.apache.spark.mllib is the first of the two Spark APIs while is the new API.

  • spark.mllib carries the original API built on top of RDDs.

  • contains higher-level API built on top of DataFrames for constructing ML pipelines.

However, the is considered as the recommended package because with DataFrames the API is more versatile and flexible. But users will keep supporting spark.mllib along with the development of Users should be comfortable using spark.mllib features as for existing algorithms not all of the functionality has been ported over to the new Spark ML API. But it is expected to have more features in the coming time.

In Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark has now changed to the DataFrame-based API in the package. Now mllib is slowly getting deprecated(this already happened in case of linear regression) and most probably will be removed in the next major release.

Browse Categories