Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

If I have data containing 5 categories (A,B,C,D,E) and a dataset of customers where each customer can belong to one, many or none of the categories. How can I take a data set like this:

id, categories

1 , [A,C]

2 , [B]

3 , []

4 , [D,E]

and transform the categories column to one hot encoded vectors, like this

id, categories, encoded

1 , [A,C]     , [1,0,1,0,0]

2 , [B]       , [0,1,0,0,0]

3 , []        , [0,0,0,0,0]

4 , [D,E]     , [0,0,0,1,1]

Has anyone found a simple way to do this in spark?

1 Answer

0 votes
by (41.4k points)

The below code will give you the desired output.

val data = spark.createDataFrame(Seq(

  (0L, Seq("A", "B")),

  (1L, Seq("B")),

  (2L, Seq.empty),

  (3L, Seq("D", "E"))

)).toDF("id", "categories")

 

// Get distinct tags array

val tags = data

  .flatMap(r ⇒ r.getAs[Seq[String]]("categories"))

  .distinct()

  .collect()

  .sortWith(_ < _)

 

val cvmData = new CountVectorizerModel(tags)

  .setInputCol("categories")

  .setOutputCol("sparseFeatures")

  .transform(data)

 

val asDense = udf((v: Vector) ⇒ v.toDense)

 

cvmData

  .withColumn("features", asDense($"sparseFeatures"))

  .select("id", "categories", "features")

  .show()

If you want to learn data science in-depth then enroll for best data science training.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.6k answers

500 comments

108k users

Browse Categories

...