Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

import pandas as pd

import pylab as pl

from sklearn import datasets

from sklearn.decomposition import PCA

# load dataset

iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data

df_norm = (df - df.mean()) / df.std()

# PCA

pca = PCA(n_components=2)

pca.fit_transform(df_norm.values)

print pca.explained_variance_ratio_

This returns

In [42]: pca.explained_variance_ratio_

Out[42]: array([ 0.72770452,  0.23030523])

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.

1 Answer

0 votes
by (33.1k points)

In PCA documentation, The output you need is the task of components_ attribute. It outputs an array of [n_components, n_features], so to get how components are linearly related to the different features and each coefficient represents the correlation between a particular pair of components and features.

For example:

import pandas as pd

import pylab as pl

from sklearn import datasets

from sklearn.decomposition import PCA

# load dataset

iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data

from sklearn import preprocessing

data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA

pca = PCA(n_components=2)

pca.fit_transform(data_scaled)

# Dump components relations with features:

print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm) petal length (cm)  petal width (cm)

PC-1           0.522372   -0.263355   0.581254 0.565611

PC-2          -0.372318     -0.925556 -0.021095         -0.065416

Hope this answer helps.

Visit Pandas Tutorial to learn more.

Browse Categories

...