2 views

I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn. The model is performing really well.

I wish to plot the decision boundary of the model. How can I do so?

To get a sense of the data, I am plotting it in 2D using TSNE. I reduced the dimensions of the data in 2 steps - from 300 to 50, then from 50 to 2 (this is a common recommendation). Below is the code snippet for the same :

from sklearn.manifold import TSNE

from sklearn.decomposition import TruncatedSVD

X_Train_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(X_train)

X_Train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_Train_reduced)

#some convert lists of lists to 2 dataframes (df_train_neg, df_train_pos) depending on the label -

#plot the negative points and positive points

scatter(df_train_neg.val1, df_train_neg.val2, marker='o', c='red')

scatter(df_train_pos.val1, df_train_pos.val2, marker='x', c='green')

I get a decent graph.

Is there a way that I can add a decision boundary to this plot which represents the actual decision boundary of my model in the 300 dim space?

by (33.1k points)

Simply impose a Voronoi tesselation on your 2D plot.

This is much easier than it sounds using a meshgrid and scikit's KNeighborsClassifier.

import numpy as np, matplotlib.pyplot as plt

from sklearn.neighbors.classification import KNeighborsClassifier

from sklearn.datasets.base import load_iris

from sklearn.manifold.t_sne import TSNE

from sklearn.linear_model.logistic import LogisticRegression

# replace the below by your data and model

X,y = iris.data, iris.target

X_Train_embedded = TSNE(n_components=2).fit_transform(X)

print X_Train_embedded.shape

model = LogisticRegression().fit(X,y)

y_predicted = model.predict(X)

# replace the above by your data and model

resolution = 100 # 100x100 background pixels

X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0])

X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1])

xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))

# approximate Voronoi tesselation on resolution x resolution grid using 1-NN

background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted)

voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])

voronoiBackground = voronoiBackground.reshape((resolution, resolution))

plt.contourf(xx, yy, voronoiBackground)

plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y)

plt.show()

You should precisely plot your decision boundary, this will just give you an estimate of roughly where the boundary should lie. It will draw a line between two data points belonging to different classes but will place it in the middle.

For more details on this, go through the Machine Learning Course.

Hope this answer helps.