Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

If I have data containing 5 categories (A,B,C,D,E) and a dataset of customers where each customer can belong to one, many or none of the categories. How can I take a data set like this:

id, categories

1 , [A,C]

2 , [B]

3 , []

4 , [D,E]

and transform the categories column to one hot encoded vectors, like this

id, categories, encoded

1 , [A,C]     , [1,0,1,0,0]

2 , [B]       , [0,1,0,0,0]

3 , []        , [0,0,0,0,0]

4 , [D,E]     , [0,0,0,1,1]

Has anyone found a simple way to do this in spark?

1 Answer

0 votes
by (41.4k points)

The below code will give you the desired output.

val data = spark.createDataFrame(Seq(

  (0L, Seq("A", "B")),

  (1L, Seq("B")),

  (2L, Seq.empty),

  (3L, Seq("D", "E"))

)).toDF("id", "categories")


// Get distinct tags array

val tags = data

  .flatMap(r ⇒ r.getAs[Seq[String]]("categories"))



  .sortWith(_ < _)


val cvmData = new CountVectorizerModel(tags)





val asDense = udf((v: Vector) ⇒ v.toDense)



  .withColumn("features", asDense($"sparseFeatures"))

  .select("id", "categories", "features")


If you want to learn data science in-depth then enroll for best data science training.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers


94k users

Browse Categories