I have one data-frame with list of tokens.
data1 = [(1, ["This","is", "category", "A"]),
(2, ["This", "is", "category", "B","This", "is", "category", "B"]),
(3, ["This", "is", "category", "F","This", "is", "category", "C"])]
df2=spark.createDataFrame(data1).withColumnRenamed('_1','category').withColumnRenamed('_2','tokens')
I have another dataframe with tokens and their vector representation. Here is the schema for second one
StructType(List(StructField(word,StringType,true),StructField(vector,ArrayType(DoubleType,true),true)))
word vector
you [0.04986, 0.5678]
I want to look up the list of tokens into the data frame with vector representation and calculate the mean in pyspark.
Please let me know how I can do this efficiently in pyspark.
The logic in python/panda is per http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ as below
return np.array([ np.mean([self.word2vec[w] for w in words if w in self.word2vec] or [np.zeros(self.dim)], axis=0) for words in X