Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

How do I handle categorical data with spark-ml and not spark-mllib ?

Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the column of features in the DataFrame, and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame.

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol.

However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

How should I proceed?

1 Answer

0 votes
by (32.3k points)
edited by

Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Therefore, I will suggest you to use OneHotEncoderEstimator instead.

In Scala:

image

image

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories

...