• Articles
  • Tutorials
  • Interview Questions

Scikit-Learn Cheat Sheet: Python Machine Learning

Tutorial Playlist

Python Scikit-Learn Cheat Sheet

If you are finding it hard to remember all the different commands to perform different operations in Scikit Learn then don’t worry, you are not alone, it happens more often than you would think.

Download the printable PDF of this cheat sheet

Python Scikit-Learn Cheat Sheet

 

At Intellipaat, we make sure that our learners get the best out of our e-learning services and that is exactly why we have come up with this Sklearn Cheat-Sheet to support our learners, in case they need a handy reference to help them get started with Scikit in python training.

This cheat sheet has been designed assuming that you have a basic knowledge of python and machine learning but need a quick reference to turn to when you need to look up the commands in Scikit.

Learn more about Scikit-Learn Cheat Sheet:

What is Scikit Learn?

Scikit-Learn or “sklearn is a free, open-source machine learning library for the Python programming language. It’s a simple yet efficient tool for data mining, Data analysis, and Machine Learning. It features various machine learning algorithms and also supports Python’s scientific and numerical libraries, that is, SciPy and NumPy respectively. Get back to learn Python for all other topics..

Import Convention

Before you can start using Python Scikit-learn, you need to remember that it is a Python library and you need to import it. To do that all you have to do is type the following command:

import sklearn 

Preprocessing

The process of converting raw data set into a meaningful and clean data set is referred to as Preprocessing of data. This is a ‘must- follow’ technique before you can feed your data set to a machine learning algorithm. There are mainly three steps that you need to follow while preprocessing the data. The steps are listed below:

1. Data Loading:

You need your data in numeric form stored in numeric arrays. Following are the two ways you can load the data, you can also use some other numeric array to load your data.

Using NumPy :

import numpy as np
a=np.array([(1,2,3,4),(7,8,9,10)],dtype=int)
data = np.loadtxt('file_name.csv', delimiter=',')
Using Pandas :
import pandas as pd
df=pd.read_csv('file_name.csv',header=0)

 

2. Train-Test data:

The next step is to split your data in training data set and testing data set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

Get 100% Hike!

Master Most in Demand Skills Now!

3. Data Preparation:

Standardization: It makes the training process well behaved improving the numerical condition of the optimization problems.

from sklearn.preprocessing import StandardScaler
get_names = df.columns
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=get_names)

 

Normalization: It makes training less sensitive to the scale of features, also makes the data better conditioned for convergence.

from sklearn.preprocessing import Normalizer
df =pd.read_csv("File_name.csv")
x_array = x_array = np.array(df['Column1'])#Normalize Column1
normalized_X = Normalizer().fit_trasnsform([x_array])

 

Master Data Science with Real-World Projects
Unlock Your Potential in Data Analysis, Modeling, and Visualization
quiz-icon

Working on a model

After making all the necessary transformation in our dataset, in order to make it algorithm-ready, we need to work on our model, that is, choosing a correct model or an algorithm that represents our dataset and will help us make the kind of predictions that we want from our chosen data set and then performing model fitting.

Model Choosing:

  • Supervised Learning Estimator:

Supervised learning, as the name suggests, is the kind of machine learning where we supervise the outcome by training the model with well labeled data, which means that some of the data in the dataset will already be tagged with correct answers.

a. Linear Regression:

from sklearn.linear_model import LinearRegression
new_lr = LinearRegression()

b. Support Vector Machine:

from sklearn.svm import SVC
new_svc = SVC(kernel='linear')

c. Naive Bayes:

from sklearn.naive_bayes import GaussianNB
new_gnb = GaussianNB()

d. KNN:

from sklearn import neighbors
knn=neighbors.KNeighborsClassifier(n_neighbors=1)

 

  • Unsupervised Learning Estimator:

Unlike Supervised learning, unsupervised learning is where we train the model with non labeled data or non classified data and let the algorithm do all the work on that dataset without any assistance.

a. Principal Component Analysis (PCA):

from sklearn.decomposition import PCA
new_pca= PCA(n_components=0.95)

b. K Means:

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=5, random_state=0)

Model Fitting:

The goal of implementing model fitting is to learn how well a model will generalize when trained with a dataset similar to the dataset that the model was initially trained on. The more fitting model will produce more accurate outcomes.

  • Supervised:
new_ lr.fit(X, y)
knn.fit(X_train, y_train)
new_svc.fit(X_train, y_train)
  • Unsupervised:
k_means.fit(X_train)
X_train_pca = new_pca.fit_transform(X_train)
Unlock the Power of Data Science and AI
Master Data-Driven Decision Making, AI Models, and Advanced Analytics
quiz-icon

Post-Processing

After getting comfortable with our dataset and model, the next step is to finally follow the main goal of machine learning algorithms, that is, to forecast outcomes and make predictions.

Prediction:

Once you are done with choosing and fiting the model, you can make predictions on your dataset.

  • Supervised:
y_predict = new_svc.predict(np.random.random((3,5)))
y_predict = new_lr.predict(X_test)
y_predict = knn.predict_proba(X_test)
  • Unsupervised:
y_pred = k_means.predict(X_test)

Evaluate Performance:

Evaluating the predictive performance of your model is necessary. There are multiple techniques in machine learning that can be used to organize classifiers and visualize their performance. Following are the said technologies.

Classification Matrix:

a. Confusion Matrix:

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

b. Accuracy Score

knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Regression Matrix:

a. Mean Absolute Error:

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_predict)

b. Mean Squared Error:

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_predict)

c. R² Score

from sklearn.metrics import r2_score
r2_score(y_true, y_predict)

Clustering Matrix:

a. Homogeneity:

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_predict)

b. V-measure:

from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_predict)

c. Cross-validation:

from sklearn.cross_validation import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(new_lr, X, y, cv=2))

Model Tuning:

This is the final step when implementing machine learning, before presenting the final outcomes. In Model tuning, models are parameterized so their behavior is tuned for a given problem. This is done by searching for the right set of parameters and we have mainly two ways of doing that:

  • Grid Search:

In Grid search, parameter tuning is done methodically and then it evaluates model for each set of parameter that is specified in a grid.

from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)
  • Randomized Parameter Optimization:

In Randomised Search, random search is performed on a fixed set of parameters. The number of parameters that are used is given by n-iter.

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params,cv=4,n_iter=8, random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

Download a Printable PDF of this Cheat Sheet

With this, comes the end of this Sklearn cheat sheet. You can enroll for Python Certification Training provided by Intellipaat for detailed and in-depth knowledge. This training program will guide you step by step will provide you with all the right set of skills to master one of the most popular and widely used language, Python. Not only that, you will also gain knowledge on all the important libraries and modules in python such as, like SciPy, NumPy, MatPlotLib,Scikit-learn, Pandas, Lambda function and more. Also, Intellipaat will assist you with free python developer interview questions by experts. You will have 24*7 technical support and assistance from the experts in respective technologies here at intellipaat throughout the certification period.

About the Author

Senior Consultant Analytics & Data Science

Sahil Mattoo, a Senior Software Engineer at Eli Lilly and Company, is an accomplished professional with 14 years of experience in languages such as Java, Python, and JavaScript. Sahil has a strong foundation in system architecture, database management, and API integration.