Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

Trying to do doc classification in Spark. I am not sure what the hashing does in HashingTF; does it sacrifice any accuracy? I doubt it, but I don't know. The spark doc says it uses the "hashing trick"... just another example of really bad/confusing naming used by engineers (I'm guilty as well). CountVectorizer also requires setting the vocabulary size, but it has another parameter, a threshold param that can be used to exclude words or tokens that appear below some threshold in the text corpus. I do not understand the difference between these two Transformers.

1 Answer

0 votes
by (32.3k points)

I think the following important differences between Hashing TF and CountVectorizer will be enough to help you:

  • CountVectorizer can be also stated as partially reversible. Whereas, HashingTF is irreversible. Now, since hashing is not reversible you cannot restore original input from a hash vector. From the other hand count vector with model (index) can be used to restore unordered input. As the consequence, models that are created using hashed input gets much harder to interpret and monitor.

  • memory and computational overhead - HashingTF requires no additional memory beyond original input and vector and also requires a single data scan. For CountVectorizer an additional scan is required over the data to build a model and additional memory to store vocabulary (index). In case of unigram language model it is usually not a problem but in case of higher n-grams it can be prohibitively expensive or not feasible.

  • Hashing depends on the size of the vector, hashing function and a document And counting depends on the size of the vector, training corpus and a document.

  • A source of the information loss - In case of HashingTF, it is dimensionality reduction with possible collisions. CountVectorize discards infrequent tokens. How it affects downstream models depends on a particular use case and data.

Browse Categories