0 votes
1 view
in AI and Deep Learning by (48.4k points)

I just wanted some general tips on how data should be pre-processed prior to feeding it into a machine learning algorithm. I'm trying to further my understanding of why we make different decisions at preprocessing times and if someone could please go through all of the different things we need to consider when cleaning up data, removing superfluous data, etc. I would find it very informative as I have searched the net a lot for some canonical answers or rules of thumb here and there doesn't seem to be any.

I have a set of data in a .tsv file available here. The training set amounts to 7,000 rows, the test set 3,000. What different strategies should I use for handling badly-formed data if 100 rows are not readable in each? 500? 1,000? Any guidelines to help me reason about this would be very much appreciated.

The sample code would be great to see but it is not necessary if you don't feel like it, I just want to understand what I should be doing! :)


1 Answer

0 votes
by (104k points)

These are the following steps for the Data Preprocessing for Machine Learning:

Step 1: Import Libraries. The first step is usually importing the libraries that will be needed in the program.

Step 2: Import the Dataset.

Step 3: Taking care of Missing Data in Dataset.

Step 4: Encoding categorical data.

Step 5: Splitting the Dataset into the Training set and Test Set.

Step 6: Feature Scaling.

For better understanding refer the following link: https://medium.com/datadriveninvestor/data-preprocessing-for-machine-learning-188e9eef1d2c

Welcome to Intellipaat Community. Get your technical queries answered by top developers !