Explore Courses Blog Tutorials Interview Questions
+4 votes
in Machine Learning by (19k points)

Binarization is the act of transforming colorful features of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.

If we were to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by its ID giving the vector <3,1,2,3,4>.

Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.

Now for my question, what is the best binarization method when it comes to describing features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?

In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example, the top word on the stack in the first word in the input queue. Since the order is highly relevant, this rules out the bag-of-words-method.

With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example, I don't want a word bigram to use 400 million features for 20000 unique words if only 2% of the bigrams actually exist.

Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.

1 Answer

+4 votes
by (33.1k points)
edited by

Your question is quite unclear to understand which technology you’re working. You should decide whether to lemmatize your input tokens (your words). By implementing this, you can decrease your type count, and syntax parsing will get less complicated. It is computationally expensive to lemmatize a token. If you implement a computer language like python, this task will get greatly reduced, as most languages separate keywords or variable names with a well-defined set of symbols.

The "bag-of-words" method, in the binary format you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care about the frequency of the words appear. If, on the other hand, you're building something like a compiler or parser, the order is important. You can use the token-vector approach, or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.

If you need to build parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it sub-phrase ids for every word in the bag-of-words vector, but it’s easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.

Study Natural Language Processing comprehensively with the help of this video tutorial:

Browse Categories