Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in Machine Learning by (19k points)

Using Spark ML transformers I arrived at a DataFrame where each row looks like this:

Row(object_id, text_features_vector, color_features, type_features)

where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types.

What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?

1 Answer

0 votes
by (33.1k points)

You should simply use VectorAssembler.

For example:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.DataFrame

val df: DataFrame = ???

val assembler = new VectorAssembler()

  .setInputCols(Array("text_features", "color_features", "type_features"))

  .setOutputCol("features")

val transformed = assembler.transform(df)

For more details on Vector Assembler, study Spark Tutorial.

Hope this answer helps you!

...