If I have data containing 5 categories (A,B,C,D,E) and a dataset of customers where each customer can belong to one, many or none of the categories. How can I take a data set like this:
id, categories
1 , [A,C]
2 , [B]
3 , []
4 , [D,E]
and transform the categories column to one hot encoded vectors, like this
id, categories, encoded
1 , [A,C] , [1,0,1,0,0]
2 , [B] , [0,1,0,0,0]
3 , [] , [0,0,0,0,0]
4 , [D,E] , [0,0,0,1,1]
Has anyone found a simple way to do this in spark?