I am trying to build a naive Bayes classifier with Spark's MLLib which takes as input a set of documents.
I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like
LabeledPoint[Double, List[Pair[Double, Double]].
Instead what I have as output from the rest of my code would be something like
LabeledPoint[Double, List[Pair[String, Double]].
I could make up my own conversion, but it seems odd. How am I supposed to handle this using MLLib?
I believe the answer is in the HashingTF class (i.e. hashing features) but I don't understand how that works, it appears that it takes some sort of capacity value, but my list of keywords and topics is effectively unbounded (or better, unknown at the beginning).