Introduction to Python Scikit-learn
Python Scikit-learn is a free Machine Learning library for Python. It’s a very useful tool for data mining and data analysis and can be used for personal as well as commercial use.
Python Scikit-learn lets users perform various Machine Learning tasks and provides a means to implement Machine Learning in Python. It needs to work with Python scientific and numerical libraries, namely, Python SciPy and Python NumPy, respectively. It’s basically a SciPy toolkit that features various Machine Learning algorithms.
Scikit-learn has small standard datasets that we don’t need to download from any external website. We can just import these datasets directly from Python Scikit-learn. Following is the list of the datasets that come with Scikit-learn:
1. Boston House Prices Dataset
2. Iris Plants Dataset
3. Diabetes Dataset
4. Digits Dataset
5. Wine Recognition Dataset
6. Breast Cancer Dataset
Here, we are going to use the Iris Plants Dataset throughout this tutorial. This dataset consists of four fields, namely, sepal length, sepal width, petal length, and petal width. It also contains a super class which contains three different classes, Iris setosa, Iris versicolor, and Iris virginica. These are basically the species of Iris plants, and the data in our dataset, i.e., the Iris plants, is divided into these three classes.
We are going to show how to import this dataset and then perform Machine Learning algorithms on it. We can import the same or any of these datasets the same way as we are following in this tutorial.
Learn more about Python from this Python for Data Science Course to get ahead in your career!
There are some Python libraries that we will have to install before we can get started with installing Scikit-learn, since Python Scikit-learn is built of these tools in order to support scientific and numerical libraries of Python. Following are the tools and libraries that we need preinstalled before using Scikit-learn:
- Python (2.7 or above)
- NumPy (1.6.1 or above)
- SciPy (0.9 or above)
Before getting started with the tutorial, have a quick overview of all that we are going to cover in this tutorial:
- Why Python Scikit-Learn?
- Installation and configuration
- Operations and Computations
- Building a model and choosing a classifier
- Who is using Python Scikit-Learn?
Why Python Scikit-learn?
There are not many threads on the Internet where we can actually find the reasons why Scikit-learn has become popular among Data Scientists, but it has some obvious benefits that justify why organizations are using and admiring Scikit-learn. Some of those benefits are listed below.
Benefits of Scikit-learn
- BSD license: Scikit-learn has a BSD license; hence, there is minimal restriction on the use and distribution of the software, making it free to use for everyone.
- Easy to use: The popularity of Scikit-learn is because of the ease of use it offers.
- Document detailing: It also offers document detailing of the API that users can access at any time on the website, helping them integrate Machine Learning into their own platforms.
- Extensive use in the industry: Scikit-learn is used extensively by various organizations to predict consumer behavior, identify suspicious activities, and much more.
- Machine Learning algorithms: Scikit-learn covers most of the Machine Learning algorithms Huge community support: Being able to perform Machine Learning tasks using Python has been one of the most important reasons behind the popularity of Scikit-learn, since Python is easy to learn and use (Learn Python here) and already has a huge community of users who can now perform Machine Learning in a platform that they are comfortable with.
- Algorithms flowchart: Unlike other programming languages where users usually face a problem of having to choose from multiple competing implementations of same algorithms, Scikit-learn has an algorithm cheat sheet or flowchart to assist the users.
Installation and Configuration
Setting up Scikit-learn Environment
As we have already seen in the prerequisites section, there is a whole set of other tools and libraries that we need to install before diving into the installation of Scikit-learn. So, let’s start off by discussing the installation of all these libraries, step by step, since the main motivation behind this tutorial is to provide information about Scikit-learn to get started with it.
In case some or all of these libraries are already installed, we can directly jump to the installation of the required library by clicking on it:
- Installing Python
- Installing NumPy
- Installing SciPy
- Installing Scikit learn Python
We will also learn how to use pip to install all these libraries, individually, for those who are not familiar with Python Pip (Pip is a package management system. It is used to manage packages written in Python or with Python dependencies).
Step 1: Installing Python
- We can easily install Python by visiting the following link:
- Make sure that we install the latest version or at least the version 2.7 or above
- After installing Python, we will need to check if Python is available for us to use on the command line. For that, open the terminal by searching for ‘cmd’ on our system. In the command line, type:
If Python is installed successfully, then it should display the Python Version that we are using. This command will open Python Interpreter.
Step 2: Installing NumPy
- NumPy is a fundamental package or library for Python that provides support to perform numerical computations
- Download the installer for NumPy by visiting the following link and then run the installer:
- We can also install NumPy by running the following command in our terminal:
pip install numpy
- If we already have NumPy, then there will be a display, ‘Requirement already satisfied’
Step 3: Installing SciPy
- SciPy is an open-source library for Python to perform scientific computations and technical computations
- Download the SciPy installer using the following link and then run it:
- We can use pip to install SciPy by typing the following command in the terminal:
pip install scipy
- If we already have SciPy, then there will be a display, ‘Requirement already satisfied’
Step 4: Installing Scikit-learn
- Use pip to install Scikit-learn using the following command:
pip install Scikit-learn
- If we already have Scikit Python, then there will be a display, ‘Requirement already satisfied’
Are you interested in learning Python from experts? Enroll in our Python Course in Bangalore now!
Operations and Computations
As we have mentioned earlier, the dataset we are going to use here in this tutorial is the Iris Plants Dataset. Scikit learn Python comes with this dataset, so we don’t need to download it externally from any other source. We will import the dataset directly, but before we do that we need to import Scikit learn and Pandas using the following commands:
import sklearn import pandas as pd
After importing sklearn, we can easily import the dataset from it, using the following command:
from sklearn.datasets import load_iris
We have successfully imported the Iris Plants Dataset from sklearn. We need to import Pandas now, because we are going to load the imported data into a Pandas DataFrame and use head() and tail() functions of Python Pandas to display the content of the DataFrame. Let’s see how to convert this dataset into a Pandas DataFrame.
iriss = load_iris() df_iris = pd.DataFrame(iriss.data, columns=iriss.feature_names)
Now, we have a DataFrame named df_iris that contains the Iris Plants Dataset imported from Scikit-learn in a tabular form. We will be performing all operations of Machine Learning on this DataFrame.
Let’s display the records from this DataFrame using the head() function:
The head() function, when used with no arguments, displays the first five rows of the DataFrame. However, we can pass any integer argument to display the same number of rows from the DataFrame. The output of the above command would be:
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)|
Now, let’s see how to display the records from the DataFrame, using the tail() function:
The tail() function, when used without any argument, displays the last five rows of the DataFrame. Similar to the head() function, we can pass any integer as an argument to display the same number of records from the end. The output of the above command would be:
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
Since the tail() function displays the last records of the DataFrame, we can see that the index number of the last row is 149. When we use the head() function, on the other hand, the index number of the first row is 0, i.e., the total number of entries is 150 or a total of 150 records are present in the Iris Plants Dataset.
Interested in learning Python? Enroll in our Python Course in London now!
Now, let’s see how we can check the data types of the fields present in the DataFrame.
sepal length (cm) float64 sepal width (cm) float64 petal length (cm) float64 petal width (cm) float64 dtype: object
So, using dtypes, we can list different columns in the DataFrame, along with their respective Python data types.
Having performed data exploration with our dataset, now let’s create some plots to visually represent the data in our dataset which will help us uncover more stories hidden in it.
Python has many libraries that provide functions to perform data visualizations on datasets. We can use the .plot extension of Pandas to create a scatterplot of features or fields of our dataset against each other, and we also need to import python matplotlib which will provide an object-oriented API to embed plots into applications.
from pandas.plotting import scatter_matrix import matplotlib.pyplot as plt scatter_matrix(df_iris,figsize=(10,10)) plt.show()
We can also use Seaborn library to create pairplots of all features in the dataset against each other. To use Seaborn, we need to import Seaborn library, first. Let’s see how it is done and how to create a Seaborn pairplot.
import seaborn as sns sns.set(style="ticks", color_codes=True) dfiris = sns.load_dataset("iris") sns.pairplot(dfiris, hue="species")
We can also use a different color palette, using palette attribute of the pairplot, as shown below:
import seaborn as sns sns.set(style="ticks", color_codes=True) dfiris = sns.load_dataset("iris") sns.pairplot(dfiris, hue="species", palette="husl")
Learning and Predicting
The scatterplot that we created is useful only up to a limited extent. It’s evident that there is grouping in the species of Iris plants into various classes, and it also shows that there exist some relationships between the fields or features. But then, it’s hard to point out which class represents which type and which datapoint represents which flower species in the scatterplot, because of such monotone of color distribution in datapoints.
Luckily, we can rectify and overcome this problem by using the Seaborn module for data visualization in Python. This is exactly what we did by creating a pairplot of the given dataset using Seaborn. We have created two different Seaborn pairplots with two different color palettes. We can refer to any one of them to draw the conclusions and predictions, whichever makes it easier for us.
Kick-start your career in Python with the perfect Python Course in New York now!
Now that we have become comfortable with the data and have made data visualizations, let’s further decide which features or fields in the dataset we are going to use to implement Machine Learning and make predictions. We have to select features that make most sense for the Machine Learning model.
But why select features at all? We might think reasonably why we can’t just use all features for our Machine Learning model and let the model do the work for us by figuring out which feature is the most relevant one? The answer for this question is that not all features serve as information. Adding features just for the sake of having data in the model will make the model unnecessarily slow and less efficient. The model will get confused with the abundance of useless data and try to fit it into itself which is just an unnecessary hassle.
That is why we need to select the features that are going to be used in the Machine Learning model.
In the pairplot that we created using the Seaborn module, we can be notice that features petal length (cm) and petal width (cm) are clustered in fairly well-defined groups.
Let’s take a better look at them:
It is also noticeable that the boundary between Iris versicolor and Iris virginica seems fuzzy, which might be a problem for some classifiers. We will have to keep that in mind. But, these features still give the most noticeable grouping between the species; hence, we will be using these two features further in our tutorial for our Machine Learning model.
Right now, we have the data in Pandas DataFrame. Before we start with the Machine Learning model, we need to convert the data into NumPy arrays, because sklearn works well with data in the form of NumPy arrays. It does not work with Pandas DataFrame.
This can be done using the following command:
labels = np.asarray(dfiris.species)
Sklearn comes with a tool, LabelEncoder(), that can encode label strings into numeric representations. It goes through the label and converts the first unique string as 0, then the next as 1, and so on. Let’s see how to use it:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(labels) labels = le.transform(labels)
Now, we will remove all features that we don’t need from our DataFrame using the drop() method:
df_selected1 = dfiris.drop(['sepal_length', 'sepal_width', "species"], axis=1)
After this, the only features we are left with are petal length and petal width.
df_features = df_selected1.to_dict(orient='records') from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer() features = vec.fit_transform(df_features).toarray()
Training Set and Test Set
Using the last command, we have converted the numerical features into label arrays. The next step is splitting up the data into training and test sets. Again, sklearn has a tool to do that called train_test_split. All we have to do is to import it and use it as follows:
from sklearn.model_selection import train_test_split features_train, features_test, labels_train, labels_test = train_test_split( features, labels, test_size=0.20, random_state=0)
Our test and training sets are ready. Now, let’s perform classification using Machine Learning algorithms or approaches, and then we will compare test accuracy of all classifiers on the test data.
Go for the most professional Python Course Online in Toronto for a stellar career now!
Building a Model and Choosing a Classifier
As we have already discussed in the benefits of Scikit learn Python section, it comes with a flowchart to help users decide which Machine Learning algorithm will suit their dataset the best. We are also going to use it as a reference to identify which algorithm we should use on our test data. The flowchart is available on Scikit-learn’s official website.
Using the following list, let’s see which category we fall into:
- Number of samples: Our number of samples is more than 50 and less than 100,000
- Whether the data is labeled: We have labeled data
- Is the category predicted?: We have predictions about the category of the Iris plants
So, going through the flowchart, we can try out following algorithms on our test set:
- SVM (Support Vector Machine)
- K-Nearest Neighbors Classifier
SVM (Support Vector Machine)
In Machine Learning, SVM or support vector machine is a learning algorithm where the algorithm analyzes the data and builds a model that is used mainly for classification or regression techniques of Machine Learning.
Here, in our case, we are using the SVM model for classification.
Computing accuracy using the test set:
from sklearn.svm import SVC svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train) svm_predictions = svm_model_linear.predict(features_test) accuracy = svm_model_linear.score(features_test, labels_test) print("Test accuracy:",accuracy)
Test accuracy: 1.0
Computing accuracy using Train set:
from sklearn.svm import SVC svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train) svm_predictions = svm_model_linear.predict(features_train) accuracy = svm_model_linear.score(features_train, labels_train) print(“Train accuracy:”,accuracy)
Train accuracy: 0.9583333333333334
Now, we can use train set accuracy and test set accuracy that we have computed to find out how much our model is over-fitting by comparing both these accuracies.
Model over-fitting is a condition or a modeling error where the function is fitting too closely to a limited set of data points.
As we can see that there is not much difference in our test accuracy and train accuracy, i.e., our model is not over-fitting.
Become a professional Python Programmer with this complete Python Training in Singapore!
K-Nearest Neighbors Classifier
KNN or K-nearest neighbors is a non-parametric learning method in Machine Learning, mainly used for classification and regression techniques. It is considered as one of the simplest algorithms in Machine Learning.
Computing accuracy using the test set:
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train) accuracy = knn.score(features_test, labels_test) print(“Test accuracy:” accuracy)
Test accuracy: 1.0
Computing accuracy using Train set:
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train) accuracy = knn.score(features_train, labels_train) print(“Train accuracy:” accuracy)
Train accuracy: 0.958
Who Is Using Python Scikit-learn?
Scikit learn Python is being extensively used by some big names in the industry. Some of them are listed below:
- Spotify: Spotify has been using Scikit-learn for a long time because of the features and models it provides. They use Scikit-learn mainly for music recommendations.
- org: Scikit-learn’s random forest classifier is used at Change.org to drive targeting emails. Scikit-learn is easy to use and it assists in variety of classifiers which makes it one of the top choices to implement Machine Learning algorithms.
- Bestofmedia Group: Scikit learn Python is used for various tasks at Bestofmedia, such as click prediction, spam fighting, and more.
- Data Publica: Data Publica is yet another big organization using Scikit-learn for building models and using them to identify potential future customers by performing predictive analysis.
Go for this in-depth job-oriented Python Training in Hyderabad now!
- Entry-level and advanced-level Programmers in Python who want to widen their skill set
- Data Analysts and Professionals who work specifically in the field of data and datasets in the real world
- Professionals who want to learn Python and start a career in Big Data
- Professionals who want a career in Artificial Intelligence
- Some experience in Python would be useful
- Prior knowledge of Machine Learning is recommended. For this, take a look at the Machine Learning tutorial by Intellipaat
Scikit learn Python has proven its worth by being able to assist in problems professionals face when they implement predictive models. Scikit Python is not just limited to the IT industry. It has various applications in variety of sectors. It can be used to implement Machine Learning and can be paired with data visualizations, making Machine Learning even more interesting. With all the benefits it has, we can easily say that Scikit Python has a wide scope.
That would be all for this module in Python Tutorial. I hope it was helpful and informative!
Looking forward to dive deep into Machine Leaning using Scikit Python?
Need not search any further; check out the Python certification by Intellipaat, where not only Scikit-learn but all concepts in Python are covered.
Also, check out what questions corporates ask during interviews in our Python interview questions listed by experts.