Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

I am trying to build a naive Bayes classifier with Spark's MLLib which takes as input a set of documents.

I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like

LabeledPoint[Double, List[Pair[Double, Double]].

Instead what I have as output from the rest of my code would be something like

LabeledPoint[Double, List[Pair[String, Double]].

I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?

I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).

1 Answer

0 votes
by (33.1k points)

You can simply use HashingTF. It uses the hashing trick to map a potentially unbounded number of features to a vector of bounded size. There might be feature collisions, but this can be smaller by choosing a larger number of features in the constructor.

If you want to create features based on not only the content of a feature but also some metadata, you could feed the HashingTF class something like 'tag:cats' so that a tag with a word would hash to a different slot than just the word.

Creating feature count vectors using HashingTF can use them to create a bag of word features by setting any counts above zero to 1. You can also create TF-IDF vectors using the IDF class like so:

val tfIdf = new IDF().fit(featureCounts).transform(featureCounts)

Here we have computed the counts of words per document. 

Hope this answer helps.

For more details on this particular domain, study Spark Mllib Tutorial.

Browse Categories