Back

Explore Courses Blog Tutorials Interview Questions
+2 votes
2 views
in Machine Learning by (4.2k points)

I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the load_diabetes.data are used to predict).

I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets.

Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles?

All I know the target values are all positive and skewed (positve skew/right skew). . . Is there a way in Python to provide a few distributions and then get the best fit for the target data/vector? OR, to actually suggest a fit based on the data that's given? That would be really useful for people who have theoretical statistical knowledge but little experience with applying it to "real data".

Bonus Would it make sense to use this type of approach to figure out what your posterior distribution would be with "real data"? If no, why not?

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

enter image description here

1 Answer

+2 votes
by (6.8k points)

To the best of my knowledge, there is no automatic way of obtaining the distribution type and parameters of a sample (as inferring the distribution of a sample is a statistical problem by itself).

In my opinion, the best you can do is:

(for each attribute)

  • Try to fit each attribute to a reasonably large list of possible distributions (e.g. see Fitting empirical distribution to theoretical ones with Scipy (Python)? for an example with Scipy)
  • Evaluate all your fits and pick the best one. This can be done by performing a Kolmogorov-Smirnov test between your sample and each of the distributions of the fit (you have an implementation in Scipy, again), and picking the one that minimizes D, the test statistic (a.k.a. the difference between the sample and the fit).

Bonus: It would make sense - as you'll be building a model on each of the variables as you pick a fit for each one - although the goodness of your prediction would depend on the quality of your data and the distributions you are using for fitting. You are building a model, after all.

You can use that code to fit (according to the maximum likelihood) different distributions with your data:

import matplotlib.pyplot as plt

import scipy

import scipy.stats

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:

    dist = getattr(scipy.stats, dist_name)

    param = dist.fit(y)

image

 You can see a sample snippet about how to use the parameters obtained here: Fitting empirical distribution to theoretical ones with Scipy (Python)?

Then, you can pick the distribution with the best log-likelihood (there are also other criteria to match the "best" distribution, such as Bayesian posterior probability, AIC, BIC or BICc values, ...).

For your bonus question, there's I think no generic answer. If your set of data is significant and obtained under the same conditions as the real word data, you can do it.

Thus, for that Probabilistic Graphical Model Training is quite important to know.

Browse Categories

...