How to handle categorical features with spark-ml?

Question

asked Jul 9, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

How do I handle categorical data with spark-ml and not spark-mllib ?

Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.

However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

How should I proceed?

1 Answer

Amit Rawat · Answer 1 · 2019-07-09T11:02:50+0000

Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Therefore, I will suggest you to use OneHotEncoderEstimator instead.

In Scala:

If you want to know more about Spark, then do check out this awesome video tutorial:

How to handle categorical features with spark-ml?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources