I just wanted some general tips on how data should be pre-processed prior to feeding it into a machine learning algorithm. I'm trying to further my understanding of why we make different decisions at preprocessing times and if someone could please go through all of the different things we need to consider when cleaning up data, removing superfluous data, etc. I would find it very informative as I have searched the net a lot for some canonical answers or rules of thumb here and there doesn't seem to be any.
I have a set of data in a .tsv file available here. The training set amounts to 7,000 rows, the test set 3,000. What different strategies should I use for handling badly-formed data if 100 rows are not readable in each? 500? 1,000? Any guidelines to help me reason about this would be very much appreciated.
The sample code would be great to see but it is not necessary if you don't feel like it, I just want to understand what I should be doing! :)