Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by
I have one data-frame with list of tokens.
data1 = [(1,  ["This","is", "category", "A"]),
    (2,  ["This", "is", "category", "B","This", "is", "category", "B"]),
    (3,  ["This", "is", "category", "F","This", "is", "category", "C"])]


I have another dataframe with tokens and their vector representation. Here is the schema for second one


word  vector
you   [0.04986, 0.5678]

I want to look up the list of tokens into the data frame with vector representation and calculate the mean in pyspark.

Please let me know how I can do this efficiently in pyspark.

The logic in python/panda is per as below

return np.array([ np.mean([self.word2vec[w] for w in words if w in self.word2vec] or [np.zeros(self.dim)], axis=0) for words in X

Please log in or register to answer this question.

Browse Categories