How to understand different dataset distribution when regression problem is concern? [closed]

Question

asked Feb 12, 2020 in Data Science by blackindya (18.4k points)

Consider the regression model on any dataset. I have a dataset named ‘a’ which consists of 10k rows, 10 features, and a target variable on which train and test model is applied. Now consider another dataset named ‘b’ with 100k rows but it doesn’t have any target variable. I wish to predict and build a model using the trained dataset of ‘a’, but I do not understand whether the dataset ‘b’ also follows a similar distribution as dataset ‘a’. Even when I train my model with the regression problem, my concern is about the confidence value, weather the predicted dataset ‘b’, is good enough or not.

I am aware of the Ks test and Earth Mover’s distance, but they only compare individual features but not an entire dataset.

1 Answer

supriya · Answer 1 · 2020-02-12T13:21:42+0000

The important point here is to understand what you wanted to solve and why?

As you have no target variable in dataset ‘b’ you can use unsupervised learning like clustering, which creates two or more different clusters according to your requirement. These clusters will be labeled by the model as cluster identifiers. Another way is to do it manually, or you can classify, based on its patterns across the dataset. Later, you can automate the task. Once it is done, you can predict it on the dataset ‘b’ using the trained dataset ‘a’.

To learn and gain more knowledge about Data Science through online and get Data Science Certification.

How to understand different dataset distribution when regression problem is concern? [closed]

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources