Python Scikit-Learn Cheat Sheet
If you are finding it hard to remember all the different commands to perform different operations in Scikit Learn then don’t worry, you are not alone, it happens more often than you would think.
Download the printable PDF of this cheat sheet
At Intellipaat, we make sure that our learners get the best out of our e-learning services and that is exactly why we have come up with this Sklearn Cheat-Sheet to support our learners, in case they need a handy reference to help them get started with Scikit in python training.
This cheat sheet has been designed assuming that you have a basic knowledge of python and machine learning but need a quick reference to turn to when you need to look up the commands in Scikit.
Learn more about Scikit-Learn Cheat Sheet:
What is Scikit Learn?
Scikit-Learn or “sklearn“ is a free, open-source machine learning library for the Python programming language. It’s a simple yet efficient tool for data mining, Data analysis, and Machine Learning. It features various machine learning algorithms and also supports Python’s scientific and numerical libraries, that is, SciPy and NumPy respectively. Get back to learn Python for all other topics..
Import Convention
Before you can start using Python Scikit-learn, you need to remember that it is a Python library and you need to import it. To do that all you have to do is type the following command:
import sklearn
Preprocessing
The process of converting raw data set into a meaningful and clean data set is referred to as Preprocessing of data. This is a ‘must- follow’ technique before you can feed your data set to a machine learning algorithm. There are mainly three steps that you need to follow while preprocessing the data. The steps are listed below:
1. Data Loading:
You need your data in numeric form stored in numeric arrays. Following are the two ways you can load the data, you can also use some other numeric array to load your data.
Using NumPy :
import numpy as np
>>>a=np.array([(1,2,3,4),(7,8,9,10)],dtype=int)
>>>data = np.loadtxt('file_name.csv', delimiter=',')
Using Pandas :
>>>import pandas as pd
>>>df=pd.read_csv(‘file_name.csv’,header=0)
2. Train-Test data:
The next step is to split your data in training data set and testing data set
>>>from sklearn.model_selection import train_test_split
>>>X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
Get 100% Hike!
Master Most in Demand Skills Now!
3. Data Preparation:
Standardization: It makes the training process well behaved improving the numerical condition of the optimization problems.
>>>from sklearn.preprocessing import StandardScaler
>>>get_names = df.columns
>>>scaler = preprocessing.StandardScaler()
>>>scaled_df = scaler.fit_transform(df)
>>>scaled_df = pd.DataFrame(scaled_df, columns=get_names)
Normalization: It makes training less sensitive to the scale of features, also makes the data better conditioned for convergence.
>>>from sklearn.preprocessing import Normalizer
>>>pd.read_csv("File_name.csv")
>>>x_array = np.array(df[‘Column1’])#Normalize Column1
>>>normalized_X = preprocessing.normalize([x_array])
Working on a model
After making all the necessary transformation in our dataset, in order to make it algorithm-ready, we need to work on our model, that is, choosing a correct model or an algorithm that represents our dataset and will help us make the kind of predictions that we want from our chosen data set and then performing model fitting.
Model Choosing:
- Supervised Learning Estimator:
Supervised learning, as the name suggests, is the kind of machine learning where we supervise the outcome by training the model with well labeled data, which means that some of the data in the dataset will already be tagged with correct answers.
a. Linear Regression:
>>> from sklearn.linear_model import LinearRegression
>>>new_lr = LinearRegression(normalize=True)
b. Support Vector Machine:
>>>from sklearn.svm import SVC
>>>new_svc = SVC(kernel='linear')
c. Naive Bayes:
>>>from sklearn.naive_bayes import GaussianNB
>>>new_gnb = GaussianNB()
d. KNN:
>>>from sklearn import neighbors
>>>knn=neighbors.KNeighborsClassifier(n_neighbors=1)
- Unsupervised Learning Estimator:
Unlike Supervised learning, unsupervised learning is where we train the model with non labeled data or non classified data and let the algorithm do all the work on that dataset without any assistance.
a. Principal Component Analysis (PCA):
>>>from sklearn.decomposition import PCA
>>>new_pca= PCA(n_components=0.95)
b. K Means:
>>>from sklearn.cluster import KMeans
>>>k_means = KMeans(n_clusters=5, random_state=0)
Model Fitting:
The goal of implementing model fitting is to learn how well a model will generalize when trained with a dataset similar to the dataset that the model was initially trained on. The more fitting model will produce more accurate outcomes.
>>>new_ lr.fit(X, y)
>>>knn.fit(X_train, y_train)
>>>new_svc.fit(X_train, y_train)
>>>k_means.fit(X_train)
>>>pca_model_fit = new_pca.fit_transform(X_train)
Post-Processing
After getting comfortable with our dataset and model, the next step is to finally follow the main goal of machine learning algorithms, that is, to forecast outcomes and make predictions.
Prediction:
Once you are done with choosing and fiting the model, you can make predictions on your dataset.
Supervised
>>>y_predict = new_svc.predict(np.random.random((3,5)))
>>>y_predict = new_lr.predict(X_test)
>>>y_predict = knn.predict_proba(X_test)
>>>y_pred = k_means.predict(X_test)
Evaluate Performance:
Evaluating the predictive performance of your model is necessary. There are multiple techniques in machine learning that can be used to organize classifiers and visualize their performance. Following are the said technologies.
Classification:
a. Confusion Matrix:
>>> from sklearn.metrics import confusion_matrix
>>>print(confusion_matrix(y_test, y_pred))
b. Accuracy Score
>>>knn.score(X_test, y_test)
>>> from sklearn.metrics import accuracy_score
>>>accuracy_score(y_test, y_pred)
Regression:
a. Mean Absolute Error:
>>> from sklearn.metrics import mean_absolute_error
>>>y_true = [3, -0.5, 2]
>>>mean_absolute_error(y_true, y_predict)
b. Mean Squared Error:
>>> from sklearn.metrics import mean_squared_error
>>>mean_squared_error(y_test, y_predict)
c. R² Score
>>> from sklearn.metrics import r2_score>>> r2_score(y_true, y_predict)
Clustering:
a. Homogeneity:
>>> from sklearn.metrics import homogeneity_score
>>>homogeneity_score(y_true, y_predict)
b. V-measure:
>>> from sklearn.metrics import v_measure_score
>>>metrics.v_measure_score(y_true, y_predict)
c. Cross-validation:
>>> from sklearn.cross_validation import cross_val_score
>>>print(cross_val_score(knn, X_train, y_train, cv=4))
>>>print(cross_val_score(new_lr, X, y, cv=2))
Model Tuning:
This is the final step when implementing machine learning, before presenting the final outcomes. In Model tuning, models are parameterized so their behavior is tuned for a given problem. This is done by searching for the right set of parameters and we have mainly two ways of doing that:
In Grid search, parameter tuning is done methodically and then it evaluates model for each set of parameter that is specified in a grid.
>>> from sklearn.grid_search import GridSearchCV
>>>params = {"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn, param_grid=params)
>>>grid.fit(X_train, y_train)
>>> print(grid.best_score_)
>>> print(grid.best_estimator_.n_neighbors)
- Randomized Parameter Optimization:
In Randomised Search, random search is performed on a fixed set of parameters. The number of parameters that are used is given by n-iter.
>>> from sklearn.grid_search import RandomizedSearchCV
>>>params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}
>>>rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params,cv=4,n_iter=8, random_state=5)
>>>rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
With this, comes the end of this Sklearn cheat sheet. You can enroll for Python Certification Training provided by Intellipaat for detailed and in-depth knowledge. This training program will guide you step by step will provide you with all the right set of skills to master one of the most popular and widely used language, Python. Not only that, you will also gain knowledge on all the important libraries and modules in python such as, like SciPy, NumPy, MatPlotLib,Scikit-learn, Pandas, Lambda function and more. Also, Intellipaat will assist you with free python developer interview questions by experts. You will have 24*7 technical support and assistance from the experts in respective technologies here at intellipaat throughout the certification period.