0 votes
1 view
in Data Science by (17.6k points)

If I have data containing 5 categories (A,B,C,D,E) and a dataset of customers where each customer can belong to one, many or none of the categories. How can I take a data set like this:

id, categories

1 , [A,C]

2 , [B]

3 , []

4 , [D,E]

and transform the categories column to one hot encoded vectors, like this

id, categories, encoded

1 , [A,C]     , [1,0,1,0,0]

2 , [B]       , [0,1,0,0,0]

3 , []        , [0,0,0,0,0]

4 , [D,E]     , [0,0,0,1,1]

Has anyone found a simple way to do this in spark?

1 Answer

0 votes
by (38.2k points)

The below code will give you the desired output.

val data = spark.createDataFrame(Seq(

  (0L, Seq("A", "B")),

  (1L, Seq("B")),

  (2L, Seq.empty),

  (3L, Seq("D", "E"))

)).toDF("id", "categories")

 

// Get distinct tags array

val tags = data

  .flatMap(r ⇒ r.getAs[Seq[String]]("categories"))

  .distinct()

  .collect()

  .sortWith(_ < _)

 

val cvmData = new CountVectorizerModel(tags)

  .setInputCol("categories")

  .setOutputCol("sparseFeatures")

  .transform(data)

 

val asDense = udf((v: Vector) ⇒ v.toDense)

 

cvmData

  .withColumn("features", asDense($"sparseFeatures"))

  .select("id", "categories", "features")

  .show()

If you want to learn data science in-depth then enroll for best data science training.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...