Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

import pandas as pd

import pylab as pl

from sklearn import datasets

from sklearn.decomposition import PCA

# load dataset

iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data

df_norm = (df - df.mean()) / df.std()

# PCA

pca = PCA(n_components=2)

pca.fit_transform(df_norm.values)

print pca.explained_variance_ratio_

This returns

In [42]: pca.explained_variance_ratio_

Out[42]: array([ 0.72770452,  0.23030523])

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.

1 Answer

0 votes
by (33.1k points)

In PCA documentation, The output you need is the task of components_ attribute. It outputs an array of [n_components, n_features], so to get how components are linearly related to the different features and each coefficient represents the correlation between a particular pair of components and features.

For example:

import pandas as pd

import pylab as pl

from sklearn import datasets

from sklearn.decomposition import PCA

# load dataset

iris = datasets.load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data

from sklearn import preprocessing

data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA

pca = PCA(n_components=2)

pca.fit_transform(data_scaled)

# Dump components relations with features:

print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm) petal length (cm)  petal width (cm)

PC-1           0.522372   -0.263355   0.581254 0.565611

PC-2          -0.372318     -0.925556 -0.021095         -0.065416

Hope this answer helps.

Visit Pandas Tutorial to learn more.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.7k users

Browse Categories

...