Python Scikit-learn is a free Machine Learning library for Python. It’s a very useful tool for data mining and data analysis and can be used for personal as well as commercial use.
Python Scikit-learn lets users perform various Machine Learning tasks and provides a means to implement Machine Learning in Python. It needs to work with Python scientific and numerical libraries, namely, Python SciPy and Python NumPy, respectively. It’s basically a SciPy toolkit that features various Machine Learning algorithms.
Watch this Python Online training video by Intellipaat
Scikit-learn has small standard datasets that we don’t need to download from any external website. We can just import these datasets directly from Python Scikit-learn. Following is the list of the datasets that come with Scikit-learn:
1. Boston House Prices Dataset
2. Iris Plants Dataset
3. Diabetes Dataset
4. Digits Dataset
5. Wine Recognition Dataset
6. Breast Cancer Dataset
Here, we are going to use the Iris Plants Dataset throughout this tutorial. This dataset consists of four fields, namely, sepal length, sepal width, petal length, and petal width. It also contains a super class which contains three different classes, Iris setosa, Iris versicolor, and Iris virginica. These are basically the species of Iris plants, and the data in our dataset, i.e., the Iris plants, is divided into these three classes.
We are going to show how to import this dataset and then perform Machine Learning algorithms on it. We can import the same or any of these datasets the same way as we are following in this tutorial.
There are some Python libraries that we will have to install before we can get started with installing Scikit-learn, since Python Scikit-learn is built of these tools in order to support scientific and numerical libraries of Python. Following are the tools and libraries that we need preinstalled before using Scikit-learn:
Before getting started with the tutorial, have a quick overview of all that we are going to cover in this tutorial:
There are not many threads on the Internet where we can actually find the reasons why Scikit-learn has become popular among Data Scientists, but it has some obvious benefits that justify why organizations are using and admiring Scikit-learn. Some of those benefits are listed below.
As we have already seen in the prerequisites section, there is a whole set of other tools and libraries that we need to install before diving into the installation of Scikit-learn. So, let’s start off by discussing the installation of all these libraries, step by step, since the main motivation behind this tutorial is to provide information about Scikit-learn to get started with it.
In case some or all of these libraries are already installed, we can directly jump to the installation of the required library by clicking on it:
We will also learn how to use pip to install all these libraries, individually, for those who are not familiar with Python Pip (Pip is a package management system. It is used to manage packages written in Python or with Python dependencies).
Step 1: Installing Python
If Python is installed successfully, then it should display the Python Version that we are using. This command will open Python Interpreter.
Step 2: Installing NumPy
pip install numpy
Step 3: Installing SciPy
Step 4: Installing Scikit-learn
pip install Scikit-learn
As we have mentioned earlier, the dataset we are going to use here in this tutorial is the Iris Plants Dataset. Scikit learn Python comes with this dataset, so we don’t need to download it externally from any other source. We will import the dataset directly, but before we do that we need to import Scikit learn and Pandas using the following commands:
After importing sklearn, we can easily import the dataset from it, using the following command:
We have successfully imported the Iris Plants Dataset from sklearn. We need to import Pandas now, because we are going to load the imported data into a Pandas DataFrame and use head() and tail() functions of Python Pandas to display the content of the DataFrame. Let’s see how to convert this dataset into a Pandas DataFrame.
Now, we have a DataFrame named df_iris that contains the Iris Plants Dataset imported from Scikit-learn in a tabular form. We will be performing all operations of Machine Learning on this DataFrame.
Let’s display the records from this DataFrame using the head() function:
The head() function, when used with no arguments, displays the first five rows of the DataFrame. However, we can pass any integer argument to display the same number of rows from the DataFrame. The output of the above command would be:
|sepal length (cm)||sepal width (cm)||petal length (cm)||petal width (cm)|
Now, let’s see how to display the records from the DataFrame, using the tail() function:
The tail() function, when used without any argument, displays the last five rows of the DataFrame. Similar to the head() function, we can pass any integer as an argument to display the same number of records from the end. The output of the above command would be:
sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)
Since the tail() function displays the last records of the DataFrame, we can see that the index number of the last row is 149. When we use the head() function, on the other hand, the index number of the first row is 0, i.e., the total number of entries is 150 or a total of 150 records are present in the Iris Plants Dataset.
Now, let’s see how we can check the data types of the fields present in the DataFrame.
So, using dtypes, we can list different columns in the DataFrame, along with their respective Python data types.
Having performed data exploration with our dataset, now let’s create some plots to visually represent the data in our dataset which will help us uncover more stories hidden in it.
Python has many libraries that provide functions to perform data visualizations on datasets. We can use the .plot extension of Pandas to create a scatterplot of features or fields of our dataset against each other, and we also need to import python matplotlib which will provide an object-oriented API to embed plots into applications.
We can also use Seaborn library to create pairplots of all features in the dataset against each other. To use Seaborn, we need to import Seaborn library, first. Let’s see how it is done and how to create a Seaborn pairplot.
We can also use a different color palette, using palette attribute of the pairplot, as shown below:
The scatterplot that we created is useful only up to a limited extent. It’s evident that there is grouping in the species of Iris plants into various classes, and it also shows that there exist some relationships between the fields or features. But then, it’s hard to point out which class represents which type and which datapoint represents which flower species in the scatterplot, because of such monotone of color distribution in datapoints.
Luckily, we can rectify and overcome this problem by using the Seaborn module for data visualization in Python. This is exactly what we did by creating a pairplot of the given dataset using Seaborn. We have created two different Seaborn pairplots with two different color palettes. We can refer to any one of them to draw the conclusions and predictions, whichever makes it easier for us.
Now that we have become comfortable with the data and have made data visualizations, let’s further decide which features or fields in the dataset we are going to use to implement Machine Learning and make predictions. We have to select features that make most sense for the Machine Learning model.
But why select features at all? We might think reasonably why we can’t just use all features for our Machine Learning model and let the model do the work for us by figuring out which feature is the most relevant one? The answer for this question is that not all features serve as information. Adding features just for the sake of having data in the model will make the model unnecessarily slow and less efficient. The model will get confused with the abundance of useless data and try to fit it into itself which is just an unnecessary hassle.
That is why we need to select the features that are going to be used in the Machine Learning model.
In the pairplot that we created using the Seaborn module, we can be notice that features petal length (cm) and petal width (cm) are clustered in fairly well-defined groups.
Let’s take a better look at them:
It is also noticeable that the boundary between Iris versicolor and Iris virginica seems fuzzy, which might be a problem for some classifiers. We will have to keep that in mind. But, these features still give the most noticeable grouping between the species; hence, we will be using these two features further in our tutorial for our Machine Learning model.
Right now, we have the data in Pandas DataFrame. Before we start with the Machine Learning model, we need to convert the data into NumPy arrays, because sklearn works well with data in the form of NumPy arrays. It does not work with Pandas DataFrame.
This can be done using the following command:
Sklearn comes with a tool, LabelEncoder(), that can encode label strings into numeric representations. It goes through the label and converts the first unique string as 0, then the next as 1, and so on. Let’s see how to use it:
Now, we will remove all features that we don’t need from our DataFrame using the drop() method:
After this, the only features we are left with are petal length and petal width.
Using the last command, we have converted the numerical features into label arrays. The next step is splitting up the data into training and test sets. Again, sklearn has a tool to do that called train_test_split. All we have to do is to import it and use it as follows:
Our test and training sets are ready. Now, let’s perform classification using Machine Learning algorithms or approaches, and then we will compare test accuracy of all classifiers on the test data.
As we have already discussed in the benefits of Scikit learn Python section, it comes with a flowchart to help users decide which Machine Learning algorithm will suit their dataset the best. We are also going to use it as a reference to identify which algorithm we should use on our test data. The flowchart is available on Scikit-learn’s official website.
Using the following list, let’s see which category we fall into:
So, going through the flowchart, we can try out following algorithms on our test set:
In Machine Learning, SVM or support vector machine is a learning algorithm where the algorithm analyzes the data and builds a model that is used mainly for classification or regression techniques of Machine Learning.
Here, in our case, we are using the SVM model for classification.
Computing accuracy using the test set:
Computing accuracy using Train set:
Now, we can use train set accuracy and test set accuracy that we have computed to find out how much our model is over-fitting by comparing both these accuracies.
Model over-fitting is a condition or a modeling error where the function is fitting too closely to a limited set of data points.
As we can see that there is not much difference in our test accuracy and train accuracy, i.e., our model is not over-fitting.
KNN or K-nearest neighbors is a non-parametric learning method in Machine Learning, mainly used for classification and regression techniques. It is considered as one of the simplest algorithms in Machine Learning.
Computing accuracy using the test set:
Computing accuracy using Train set:
Again, we can use train set accuracy and test set accuracy to find out if the model is over-fitting.
Note: We don’t have to worry if the end results are slightly different; the accuracy in these classifiers are expected to vary sometimes.
Scikit learn Python is being extensively used by some big names in the industry. Some of them are listed below:
Scikit learn Python has proven its worth by being able to assist in problems professionals face when they implement predictive models. Scikit Python is not just limited to the IT industry. It has various applications in variety of sectors. It can be used to implement Machine Learning and can be paired with data visualizations, making Machine Learning even more interesting. With all the benefits it has, we can easily say that Scikit Python has a wide scope.
That would be all for this module in Python Tutorial. I hope it was helpful and informative!
Looking forward to dive deep into Machine Leaning using Scikit Python?
Need not search any further; check out the Python certification by Intellipaat, where not only Scikit-learn but all concepts in Python are covered.
Also, check out what questions corporates ask during interviews in our Python interview questions listed by experts.
Download Interview Questions asked by top MNCs in 2019?