Binarization is the act of transforming colorful features of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we were to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by its ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describing features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example, the top word on the stack in the first word in the input queue. Since the order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example, I don't want a word bigram to use 400 million features for 20000 unique words if only 2% of the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.