Plotting decision boundary for High Dimension Data

Question

asked Jul 2, 2019 in Data Science by ParasSharma1 (19k points)

I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn. The model is performing really well.

I wish to plot the decision boundary of the model. How can I do so?

To get a sense of the data, I am plotting it in 2D using TSNE. I reduced the dimensions of the data in 2 steps - from 300 to 50, then from 50 to 2 (this is a common recommendation). Below is the code snippet for the same :

from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
X_Train_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(X_train)
X_Train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_Train_reduced)

#some convert lists of lists to 2 dataframes (df_train_neg, df_train_pos) depending on the label -

#plot the negative points and positive points
scatter(df_train_neg.val1, df_train_neg.val2, marker='o', c='red')
scatter(df_train_pos.val1, df_train_pos.val2, marker='x', c='green')

I get a decent graph.

Is there a way that I can add a decision boundary to this plot which represents the actual decision boundary of my model in the 300 dim space?

1 Answer

Anurag · Answer 1 · 2019-07-02T10:54:22+0000

For this problem, You can use scikit learn’s KNeighborsClassifier.

K Nearest Neighbors:

KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

For example:

import numpy as np, matplotlib.pyplot as plt
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.datasets.base import load_iris
from sklearn.manifold.t_sne import TSNE
from sklearn.linear_model.logistic import LogisticRegression
# replace the below by your data and model
iris = load_iris()
X,y = iris.data, iris.target
X_Train_embedded = TSNE(n_components=2).fit_transform(X)
print X_Train_embedded.shape
model = LogisticRegression().fit(X,y)
y_predicted = model.predict(X)
# replace the above by your data and model
# create meshgrid
resolution = 100 # 100x100 background pixels
X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0])
X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1])
xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))
# approximate Voronoi tesselation on resolution x resolution grid using 1-NN
background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted)
voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
voronoiBackground = voronoiBackground.reshape((resolution, resolution))
#plot
plt.contourf(xx, yy, voronoiBackground)
plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y)
plt.show()

Hope this answer helps.

If you wish to learn more about Scikit Learn then visit this Scikit Learn Tutorial.

To get your master's degree in Data Science with job assistance. Enroll in the Masters in Data Science in Philippines!

Plotting decision boundary for High Dimension Data

Plotting decision boundary for High Dimension Data

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions