What is Scikit-learn in Python?
Scikit-learn is a free machine-learning library for Python. It’s a very useful tool for data mining and analysis and can be used for personal as well as commercial purposes.
Python Scikit-learn lets users perform various machine learning tasks and provides a means to implement machine learning in Python. It needs to work with Python scientific and numerical libraries, namely, Python SciPy and Python NumPy, respectively. It’s basically a SciPy toolkit that features various machine learning algorithms.
Scikit-learn has small standard datasets that we don’t need to download from any external website. We can just import these datasets directly from Scikit-learn. Following is a list of the datasets that come with Scikit-learn:
1. Boston House Prices Dataset
2. Iris Plants Dataset
3. Diabetes Dataset
4. Digits Dataset
5. Wine Recognition Dataset
6. Breast Cancer Dataset
In this tutorial, we will employ the Iris Plants Dataset with the assistance of Scikit-learn. The dataset comprises parameters such as sepal length, sepal width, petal length, and petal width, which collectively constitute its four fields. Additionally, the dataset incorporates a superclass that encompasses three distinct classes: Iris setosa, Iris versicolor, and Iris virginica. The data within the dataset is organized into three divisions, each corresponding to a distinct species of iris plants represented by these classes.
We will demonstrate how to import this dataset and apply machine learning techniques to it. We can import the same or any of these datasets in the same way as we are following in this tutorial.
Software prerequisites in Scikit-learn in Python:
There are some Python libraries that we will have to install before we can get started with installing Scikit-learn since Scikit-learn is built on these tools in order to support the scientific and numerical libraries of Python. Following are the tools and libraries that we need pre-installed before using Scikit-learn:
- Python (2.7 or above)
- NumPy (1.6.1 or above)
- SciPy (0.9 or above)
- Scikit-learn
Why Python Scikit-learn?
There are not many threads on the Internet where we can actually find the reasons why Scikit-learn has become popular among Data Scientists, but it has some obvious benefits that justify why organizations are using and admiring Scikit-learn. Some of those benefits are listed below.
Benefits of Scikit-learn in Python
- BSD license: Scikit-learn has a BSD license; hence, there is minimal restriction on the use and distribution of the software, making it free to use for everyone.
- Easy to use: The popularity of Scikit-learn is because of its ease of use.
- Document detailing: It also offers document detailing of the API that users can access at any time on the website, helping them integrate Machine Learning into their own platforms.
- Extensive use in the industry: Scikit-learn is used extensively by various organizations to predict consumer behavior, identify suspicious activities, and much more.
- Machine Learning algorithms: Scikit-learn covers most of the machine learning algorithms Huge community support: The ability to perform machine learning tasks using Python has been one of the most significant factors in the growth of Scikit-learn because Python is simple to learn and use (learn Python here), and it already has a large user base, allowing for the performance of machine learning on a platform that is familiar to the user. Being able to perform machine learning tasks using Python has been one of the reasons behind the popularity of Scikit-learn since Python is easy to learn and use (Learn Python here) and already has a huge community of users who can now perform Machine Learning in a platform that they are comfortable with.
- Algorithms flowchart: Unlike other programming languages, where users usually have to choose from multiple competing implementations of the same algorithm, Scikit-learn has an algorithm cheat sheet or flowchart to assist the users
Installation and Configuration
How to Install Scikit-learn?
As we have already seen in the prerequisites section, there is a whole set of other tools and libraries that we need to install before diving into the installation of Scikit-learn. So, let’s start off by discussing the installation of all these libraries, step by step, since the main motivation behind this tutorial is to provide information about Scikit-learn to get started with it.
In case some or all of these libraries are already installed, we can directly jump to the installation of the required library by clicking on it:
- Installing Python
- Installing NumPy
- Installing SciPy
- Installing Scikit learn Python
We will also learn how to use pip to install all these libraries, individually, for those who are not familiar with Python Pip (Pip is a package management system. It is used to manage packages written in Python or with Python dependencies).
Step 1: Installing Python
https://www.python.org/downloads/
- Make sure that we install the latest version or at least the version 2.7 or above
- After installing Python, we will need to check if Python is available for us to use on the command line. For that, open the terminal by searching for ‘cmd’ on our system. In the command line, type:
python
If Python is installed successfully, then it should display the Python Version that we are using. This command will open Python Interpreter.
Step 2: Installing NumPy
- NumPy is a fundamental package or library for Python that provides support to perform numerical computations
- Download the installer for NumPy by visiting the following link and then run the installer:
http://sourceforge.net/projects/numpy/files/NumPy/1.10.2/
- We can also install NumPy by running the following command in our terminal:
-
pip install numpy
- If we already have NumPy, then there will be a display, ‘Requirement already satisfied’
Step 3: Installing SciPy
- SciPy is an open-source library for Python to perform scientific computations and technical computations
- Download the SciPy installer using the following link and then run it:
http://sourceforge.net/projects/scipy/files/scipy/0.16.1/
- We can use pip to install SciPy by typing the following command in the terminal:
pip install scipy
- If we already have SciPy, then there will be a display, ‘Requirement already satisfied’
Step 4: Installing Scikit-learn
- Use pip to install Scikit-learn using the following command:
pip install Scikit-learn
- If we already have Scikit Python, then there will be a display, ‘Requirement already satisfied’
Get 100% Hike!
Master Most in Demand Skills Now!
Scikit-learn Operations and Computations
Importing Dataset in Scikit-learn
As we mentioned earlier, the dataset we are going to use here in this tutorial is the Iris Plants Dataset. Scikit-Learn Python comes with this dataset, so we don’t need to download it externally from any other source. We will import the dataset directly, but before we do that, we need to import Scikit-Learn and Pandas using the following commands:
import sklearn
import pandas as pd
After importing sklearn, we can easily import the dataset from it using the following command:
from sklearn.datasets import load_iris
We have successfully imported the Iris Plants Dataset from Sklearn. We need to import Pandas now because we are going to load the imported data into a Pandas DataFrame and use the head() and tail() functions of Python Pandas to display the content of the DataFrame. Let’s see how to convert this dataset into a Pandas DataFrame.
iriss = load_iris()
df_iris = pd.DataFrame(iriss.data, columns=iriss.feature_names)
Data Exploration
Now, we have a Data Frame named df_iris that contains the Iris Plants Dataset imported from Scikit-Learn in tabular form. We will be performing all operations of machine learning on this data frame.
Let’s display the records from this DataFrame using the head() function:
df_iris.head()
The head() function, when used with no arguments, displays the first five rows of the data frame. However, we can pass any integer argument to display the same number of rows from the data frame. The output of the above command would be:
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
Now, let’s see how to display the records from the DataFrame using the tail() function:
df_iris.tail()
The tail() function, when used without any argument, displays the last five rows of the data frame. Similar to the head() function, we can pass any integer as an argument to display the same number of records from the end. The output of the above command would be:
|
sepal length (cm)
|
sepal width (cm)
|
petal length (cm)
|
petal width (cm)
|
145
|
6.7
|
3.0
|
5.2
|
2.3
|
146
|
6.3
|
2.5 |
5.0 |
1.9 |
147
|
6.5
|
3.0 |
5.2 |
2.0
|
148
|
6.2
|
3.4 |
5.4 |
2.3
|
149 |
5.9 |
3.0 |
5.1 |
1.8
|
Since the tail() function displays the last records of the DataFrame, we can see that the index number of the last row is 149. When we use the head() function, on the other hand, the index number of the first row is 0, i.e., the total number of entries is 150 or a total of 150 records are present in the Iris Plants Dataset.
Now, let’s see how we can check the data types of the fields present in the DataFrame.
df_iris.dtypes
Output:
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
dtype: object
So, using dtypes, we can list different columns in the DataFrame, along with their respective Python data types.
Data Visualization
Having performed data exploration with our dataset, now let’s create some plots to visually represent the data in our dataset which will help us uncover more stories hidden in it.
Python has many libraries that provide functions to perform data visualizations on datasets. We can use the .plot extension of Pandas to create a scatterplot of features or fields of our dataset against each other, and we also need to import python matplotlib which will provide an object-oriented API to embed plots into applications.
Input:
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
scatter_matrix(df_iris,figsize=(10,10))
plt.show()
Output:
We can also use Seaborn library to create pairplots of all features in the dataset against each other. To use Seaborn, we need to import Seaborn library, first. Let’s see how it is done and how to create a Seaborn pairplot.
Input:
import seaborn as sns
sns.set(style="ticks", color_codes=True)
dfiris = sns.load_dataset("iris")
sns.pairplot(dfiris, hue="species")
We can also use a different color palette, using palette attribute of the pairplot, as shown below:
import seaborn as sns
sns.set(style="ticks", color_codes=True)
dfiris = sns.load_dataset("iris")
sns.pairplot(dfiris, hue="species", palette="husl")
Output:
Learning and Predicting
The scatterplot that we created is useful only to a limited extent. It’s evident that there is a grouping in the species of iris plants into various classes, and it also shows that there are some relationships between the fields or features. But then, it’s hard to point out which class represents which type and which DataPoint represents which flower species in the scatterplot because of the monotone color distribution in the data points.
Luckily, we can rectify and overcome this problem by using the Seaborn module for data visualization in Python. This is exactly what we did by creating a pairplot of the given dataset using Seaborn. We have created two different Seaborn pair plots with two different color palettes. We can refer to any one of them to draw conclusions and predictions, whichever makes it easier for us.
Selecting Features/Fields:
After being familiar with the data and creating data visualizations, let’s further pick which dataset features or fields we will utilize to perform machine learning and generate predictions. The most sensible features for the machine learning model must be chosen.
Then again, why even choose features? It is reasonable to question why we can’t simply use all features in our machine learning model and let the model choose which feature is the most pertinent to our problem. The solution to this is that not all features have an informational purpose. The model will become needlessly slow and less accurate if features are included merely for the purpose of including data.
Let’s take a better look at them:
It is also noticeable that the boundary between Iris versicolor and Iris virginica seems fuzzy, which might be a problem for some classifiers. We will have to keep that in mind. But, these features still give the most noticeable grouping between the species; hence, we will be using these two features further in our tutorial for our machine learning model.
Preparing Data
Right now, we have the data in Pandas DataFrame. Before we start with the machine learning model, we need to convert the data into NumPy arrays because Ssklearn works well with data in the form of NumPy arrays. It does not work with Pandas DataFrame.This can be done using the following command:
This can be done using the following command:
labels = np.asarray(dfiris.species)
Sklearn comes with a tool, LabelEncoder(), that can encode label strings into numeric representations. It goes through the label and converts the first unique string as 0, then the next as 1, and so on. Let’s see how to use it:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
Now, we will remove all features that we don’t need from our DataFrame using the drop() method:
df_selected1 = dfiris.drop(['sepal_length', 'sepal_width', "species"], axis=1)
After this, the only features we are left with are petal length and petal width.
df_features = df_selected1.to_dict(orient='records')
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
features = vec.fit_transform(df_features).toarray()
Training Set and Test Set
Using the last command, we have converted the numerical features into label arrays. The next step is splitting up the data into training and test sets. Again, Sklearn has a tool to do that called train_test_split. All we have to do is to import it and use it as follows:
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
features, labels, test_size=0.20, random_state=0)
Our test and training sets are ready. Now, let’s perform classification using Machine Learning algorithms or approaches, and then we will compare test accuracy of all classifiers on the test data.
Building a Model and Choosing a Classifier
As we have already discussed in the benefits of Scikit learn Python section, it comes with a flowchart to help users decide which Machine Learning algorithm will suit their dataset the best. We are also going to use it as a reference to identify which algorithm we should use on our test data. The flowchart is available on Scikit-learn’s official website.
Using the following list, let’s see which category we fall into:
- Number of samples: Our number of samples is more than 50 and less than 100,000
- Whether the data is labeled: We have labeled data
- Is the category predicted?: We have predictions about the category of the Iris plants
So, going through the flowchart, we can try out the following algorithms on our test set:
- SVM (Support Vector Machine)
- K-Nearest Neighbors Classifier
SVM (Support Vector Machine)
In Machine Learning, SVM or support vector machine is a learning algorithm where the algorithm analyzes the data and builds a model that is used mainly for classification or regression techniques of Machine Learning.
Here, in our case, we are using the SVM model for classification.
Computing accuracy using the test set:
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train)
svm_predictions = svm_model_linear.predict(features_test)
accuracy = svm_model_linear.score(features_test, labels_test)
print("Test accuracy:",accuracy)
Output:
Test accuracy: 1.0
Computing accuracy using Train set:
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train)
svm_predictions = svm_model_linear.predict(features_train)
accuracy = svm_model_linear.score(features_train, labels_train)
print(“Train accuracy:”,accuracy)
Output:
Train accuracy: 0.9583333333333334
Now, we can use train set accuracy and test set accuracy that we have computed to find out how much our model is over-fitting by comparing both these accuracies.
Model over-fitting is a condition or a modeling error where the function is fitting too closely to a limited set of data points.
As we can see that there is not much difference in our test accuracy and train accuracy, i.e., our model is not over-fitting.
K-Nearest Neighbors Classifier
KNN or K-nearest neighbors is a non-parametric learning method in Machine Learning, mainly used for classification and regression techniques. It is considered as one of the simplest algorithms in Machine Learning.
Computing accuracy using the test set:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train)
accuracy = knn.score(features_test, labels_test)
print(“Test accuracy:” accuracy)
Output:
Test accuracy: 1.0
Computing accuracy using Train set:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7).fit(features_train, labels_train)
accuracy = knn.score(features_train, labels_train)
print(“Train accuracy:” accuracy)
Output:
Again, we can use train set accuracy and test set accuracy to find out if the model is over-fitting.
Note: We don’t have to worry if the end results are slightly different; the accuracy in these classifiers are expected to vary sometimes.
Who Is Using Python Scikit-learn?
Scikit learn Python is being extensively used by some big names in the industry. Some of them are listed below:
- Spotify: Spotify has been using Scikit-learn for a long time because of the features and models it provides. They use Scikit-learn mainly for music recommendations.
- org: Scikit-learn’s random forest classifier is used at Change.org to drive targeting emails. Scikit-learn is easy to use and it assists in a variety of classifiers which makes it one of the top choices to implement Machine Learning algorithms.
- Bestofmedia Group: Scikit learn Python is used for various tasks at Bestofmedia, such as click prediction, spam fighting, and more.
- Data Publica: Data Publica is yet another big organization using Scikit-learn for building models and using them to identify potential future customers by performing predictive analysis.
Recommended Audience
- Entry-level and advanced-level Programmers in Python who want to widen their skill set
- Data Analysts and Professionals who work specifically in the field of data and datasets in the real world
- Professionals who want to learn Python and start a career in Big Data
- Professionals who want a career in Artificial Intelligence
Prerequisites in Scikit-Learn Tutorial
Knowledge prerequisites:
- Some experience in Python would be useful
- Prior knowledge of Machine Learning is recommended. For this, take a look at the Machine Learning tutorial by Intellipaat
Conclusion
The Scikit-Learn Python has shown its value by being able to help professionals during the implementation of predictive models. Scikit Python is used outside of the IT sector as well. It has numerous applications across a wide range of industries. It may be applied to machine learning implementation and combined with data visualizations to make machine learning even more fascinating. With all of its advantages, it is clear that Scikit Python has a broad use. That would be all for this module in Python Tutorial. I hope it was helpful and informative!
Are you eager to learn more about machine learning with Scikit-Python?
You don’t need to look any farther; just have a look at Intellipaat’s Python certification, which covers all Python fundamentals in addition to Scikit-learn. Also, check out what questions corporates ask during interviews in our Basic Python interview questions, listed by experts.