0 votes
1 view
in Machine Learning by (19k points)

I am trying to build a naive Bayes classifier with Spark's MLLib which takes as input a set of documents.

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like

LabeledPoint[Double, List[Pair[Double, Double]].

Instead what I have as output from the rest of my code would be something like

LabeledPoint[Double, List[Pair[String, Double]].

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).

1 Answer

0 votes
by (33.1k points)

You can simply use HashingTF. It uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There might be feature collisions, but this can be smaller by choosing a larger number of features in the constructor.

If you want to create features based on not only the content of a feature but also some metadata, you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

Creating feature count vectors using HashingTF can use them to create a bag of word features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

Here we have computed the counts of words per document. 

Hope this answer helps.

For more details on this particular domain, study Spark Mllib Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !