Scikit-Learn is a free machine learning library for python. It’s a very useful tool for data mining and data analysis and can be used for personal as well as commercial use.
Scikit-Learn lets users perform various machine learning tasks and provides the means to implement machine learning in python. This module is designed keeping in mind that it needs to work with python scientific and numerical libraries, namely, SciPy and NumPy respectively. It’s basically a SciPy toolkit that features various machine learning algorithms.
Scikit-Learn has small standard datasets that you don’t need to download from any external website. You can just import these datasets directly from Scikit-Learn. Following is the list of the datasets that come with Scikit-Learn:
1. Boston house prices Dataset
2. Iris plants Dataset
3. Diabetes Dataset
4. Digits Dataset
5. Wine recognition Dataset
6. Breast cancer Dataset
Here, we are going to use the Iris plants Dataset throughout this tutorial. This Dataset consists of 4 fields, namely, sepal length, sepal width, petal length, petal width. It also contains a super class named class which contains three different classes, Iris-Setosa, Iris-Versicolour, Iris-Virginica. These are basically the species of iris plants and the data in our dataset, that is, the iris plants have been divided into these three classes.
We are going to show how to import this dataset and then perform machine learning algorithms on the said dataset. You can import the same or any of these datasets, the same way as we are going to do in this tutorial.
You can take a look at this Machine learning tutorial by intellipaat
There are some Python libraries that you will have to install before you can get started with installing Scikit-Learn, since Scikit-Learn buildsoff of these tools in order to support scientific and numerical libraries of python.
Following are the tools and libraries that you need preinstalled before using Scikit-Learn
Before getting started with the tutorial, following is a quick overview of all that we are going to cover in this tutorial. You can click on any topic if you want to jump to a specific one.
Watch this Python Online training video by Intellipaat
There are not many threads on internet where you can actually find the reasons why Scikit-Learn has become popular among Data Scientists, but it has some obvious benefits that justify why organisations have come to use and admire Scikit-Learn. Some of those benefits are listed below
As we have already seen in the Prerequisites that there is a whole set of other tools and libraries that you need to install before diving into the installation of Scikit-Learn. So let’s start off by discussing the installation of all these other libraries, step by step since the main motivation behind this tutorial is to provide you with all enough information about Scikit-Learn to get you started with it and then some more.
In case you already have some or all of these libraries, we have provided the sequence of the installation process that we are going to follow. You can jump directly to the installation of required library by clicking on it.
I will also show how to use pip to install all these libraries individually, for those who are not familiar with pip-
Pip is a package management system. It is used to manage the packages written in python or with python dependencies.
Step 1: Installing Python
In the command line, type:
If Python is installed successfully then it should display the python version that you are using. This command will open the python interpreter.
Step 2: Installing Numpy
Step 3: Installing SciPy
Step 4: Installing Scikit-Learn
As we have mentioned earlier that the dataset that we are going to use here in this tutorial in the Iris Plants Dataset. The Scikit-Learn learn comes with this dataset so we don’t need to download it externally from any other source. We will import the dataset directly but before we do that we need to import Scikit-Learn and Pandasusing the following commands:
After importing sklearn, we can easily import the dataset from it, using the following command.
We have successfully imported the Iris Plants Dataset from sklearn.We need to import pandas because we are going to load the imported data into a pandas dataframe and use head(), tail() functions of python pandas to display the content of the dataframe. Let’s see how to convert this dataset into a pandas dataframe.
Now, we have a dataframe named df-iris that contains the Iris plants Dataset imported from Scikit-Learn in a tabular form. We will be performing all the operations of machine learning on this dataframe.
Let’s display the records from this dataframe using head() function:
head() function when used with no argument displays the first five rows of the dattaframe, however you can pass any integer argument to display the same number of rows from the dataframe. The output of the above command would be:
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)|
Using tail() function to display the records from the dataframe:
tail() function, when used without any argument, displays the last five rows of the dataframe. Similar to head() function, you can pass any integer as an argument to display the same number of records from the end. The output of the above command would be:
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
Since the tail() function displays the last records of the dataframe, we can see that the index number of the last row is 149 and when we used the head() function the index number of the first row is 0, meaning the total number of entries or the total of 150 records are present in the iris dataset.
Let’s see how we can check the datatypes of the fields present in the dataframe
So, using dtypes, we can list different columns in the dataframe along with their respective datatypes.
Having performed the data exploration for our dataset, now let’s create some plots to visually represent the data in our dataset which will help us uncover more stories hidden in our dataset.
Python has many libraries that provide functions to perform data visualizations on the datasets. We can use .plot extension of pandas to create a scatterplot of the features or the fields of our dataset against each other, we also need to import matplotlib which will provide an object oriented API to embed plots into applications.
We can also use seaborn library to create pairplot of all the features in the dataset against each other. To use seaborn, we need to import seaborn library first. Let’s see how it is done and how to create seaborn pairplot.
You can also use a different color palette, using palette attribute of pairplot, as shown below:
The scatterplot that we created was useful only upto a limited extent. It’s evident that there is grouping in the species of iris plants in various classes and it also shows that there exist some relationship between the fields or features but then it’s hard to point out which class is which and which datapoint represents which flower species in scatterplot because of such monotone of the color distribution in datapoints.
Luckily for us, we can rectify and overcome this problem by using seaborn module for data visualisation in python. This is exactly what we did by creating a pairplot of the given dataset using seaborn. We have created two different seaborn pairplot with two different color palettes. You can refer to any one of them to draw the conclusions and predictions. Whichever one makes it easier for you to make the observations.
Now that we have become comfortable with the data and have made data visualizations, let’s further decide which features or the fields in the dataset are we going to use to implement machine learning and make predictions. We have to select features that make most sense fro out machine learning model.
But why selecting features at all? You might ask, reasonably so, that why can’t we just use all the features for our machine learning model and let the model do the work for us by figuring out which feature is the most relevant one? To answer this question, not all features serve as information. Adding features that are data just for the sake of data in model will make the model unnecessarily slow and less efficient. The model will get confused with abundance of useless data and try to fit these features into the model which is just unnecessary hassle.
That is why we need to select the features that are going to be used in machine learning model.
In the pairplot that we created using seaborn module, it can be noticed that the feature petal length (cm) and petal width (cm) are clustered in fairly well defined groups.
Let’s take a better look at them closely:
It is also noticeable that the boundary between iris-versicolor and iris-viginia seems fuzzy, that might be a problem for some classifiers so we will have to keep that mind for later, but these features still give the most noticeable grouping between the species among all the features, hence we are going to be using these two features further in our tutorial for our machine learning model.
Right now, we have the data in pandas dataframe so before we start with the machine learning models, we need to convert the data into numpy arrays because sklearn works well with data in form of numpy array. It does not work with pandas dataframe.
This can be done using the following command:
Sklearn comes with a tool that can encode label strings into numeric representations. It goes through the label and converts the first unique string as 0, then the next as 1 and so on. The said tool is LabelEncoder(). Let’s see how to use this:
Now we will remove all the features from our dataframe that we don’t want using drop() method as follows:
After this, the only features that we are left with are petal length and petal width.
Using the last command we have converted the numerical features into label arrays, the next step is splitting up the data into training and test sets. Again, sklearn has a tool to do that as well. All we have to do is import it and use it as follows:
Our test and training set is ready, now let’s perform classification using machine learning algorithms or approaches and at last we will compare the test accuracy of all the classifiers on test data.
As we have already discussed in the benefits of Scikit-Learn that it comes with a flowchart to help users decide which machine algorithm will suit their dataset the best. We are also going to use as reference to identify which algorithms should we use on our test data. The flowchart is available on Scikit-Learn’s official website.
Using the following list, let’s see what category we fall into
So going through the flowchart, we can try out following algorithms on our test set:
In machine learning, SVM or support vector machine is a learning algorithm where the algorithm analyses the data and builds a model that is used for mainly classification or regression techniques of machine learning.
Here, in our case, we are using SVM model for classification.
Computing accuracy using test set:
Computing accuracy using Train set:
Now we can use the train accuracy and Test accuracy that we have computed to find out how much our model is over-fitting by comparing both of these accuracies.
Model over-fitting is a condition or a modelling error where the function is fitting too closely to a limited set of data points.
As we can see that there is not much difference in our test accuracy and train accuracy, that means that our model is not over-fitting.
KNN or K nearest neighbours is a non parametric learning method in machine learning, mainly used for classification and regression techniques of machine learning. It is considered as one of the simplest algorithms in machine learning.
Computing accuracy using Test set:
Computing accuracy using Train set:
Again, we can use train set accuracy and test set accuracy to find out if the model is over-fitting.
NOTE: Don’t worry if you get slightly different end results, the accuracy in these classifiers are expected to vary sometimes.
Scikit-Learn is being extensively used by some big dogs in the industry, some of them are listed below:
Scikit-Learn has proven its worth by being able to assist in the problems professionals face when they implement predictive models. Scikit-Learn is not just limited to the IT industry. It has various applications in variety of sectors. It can be used to implement machine learning and can be paired with data visualisations and that just makes machine learning even more interesting. With all the benefits it has, we can easily say that Scikit-Learn has a bright future scope. So, learning Scikit-Learn should be on the top of your list considering it can enhance your career options.
Looking to dive into the depths of machine leaning using Scikit-Learn? You need not look any further, we have got you covered. Check out the Python course for certification by Intellipaat, where not only will you learn Scikit-Learn but you will also learn about all the modules in python that we have used along with Scikit-Learn library in this tutorial.
That would be all for this tutorial, we hope that you found this tutorial helpful and you got to learn something.
Check out what questions interviewer asks in interview in our Python interview questions listed by experts.
Learn SQL in 16 hrs from experts