0 votes
1 view
in Data Science by (17.6k points)

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.

I took few features like, 
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows 
5) minimum value of data 
6) maximum value of data 
7) number of data between median and 75th percentile 
8) number of data between median and 25th percentile 
9) number of data between 75th percentile and upper whiskers 
10) number of data between 25th percentile and lower whiskers 
11) number of data above upper whisker 
12) number of data below lower whisker

First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).

Fun part is it worked!!

But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.

1 Answer

0 votes
by (39.1k points)

So, below is a better approach that can help you take forward this system, but it is a little bit time consuming.

 Find the nearest neighbor for each column with missing data and replace it with that value. Suppose you have k columns excluding target, so for each column, treat it as dependent variable and rest of k-1 columns as independent.

 After that, find its nearest neighbor and then its output is desired value for missing attribute.

If you want to Learn about Data Science visit this Data Science Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !