Introduction to Spark MLlib

Apache Spark comes with a library named MLlib to perform machine learning tasks using spark framework. Since we have a Python API for Apache spark, that is, as you already know, PySpark, we can also use this spark ml library in PySpark. MLlib contains many algorithms and machine learning utilities.

Watch this Apache Spark for beginners video by intellipaat

Machine Learning with PySpark Tutorial Introduction to Spark MLlib Apache Spark comes with a library named MLlib to perform machine learning tasks using spark framework. Since we have a Python API for Apache spark, that is, as you already know, PySpark, we can also use this spark ml library in PySpark. MLlib contains many algorithms and machine

In this tutorial we will see how we can use machine learning in PySpark. We are going to use the dataset of Fortune 500 to implement machine learning in PySpark. This dataset consists of the information related to the top 5 companies ranked by Fortune 500 in year 2017. We will be using the first five fields. You can download the dataset by clicking HERE.

The dataset looks like:

RankTitleWebsiteEmployeesSector
1Walmart       http://www.walmart.com2300000Retailing
2Berkshirehttp://www.berkshirehathaway.com367700Financials
3Apple       http://www.apple.com116000Technology
4Exxon Mobilhttp://www.exxonmobil.com72700Energy
5McKesson       http://www.mckesson.com68000Wholesalers

In this spark ml tutorial,we will be implementing machine learning to predict which one of the fields is the most important to predict the Rankings of the above mentioned companies in the coming years. Also, we will be using dataframes to implement machine learning so let’s start off by learning the basics of dataframes in PySpark, so as to get us started with machine learning in PySpark.

Before we dive right into the spark mllib tutorial, let me give you a quick rundown of all the topics we are going to cover in this tutorial, in case you want to jump to a specific section:

What is Machine Learning?

Machine learning is one of the many applications of Artificial intelligence (AI) where the primary aim is to enable the computers to learn automatically without any human assistance. With the help of machine learning, computers are able to tackle the tasks that were, until now, only handled and carried out by people. It’s basically a process of teaching a system, how to make accurate predictions when fed the right data. It provides the ability to learn and improve from experience without being specifically programmed for that task. Machine learning mainly focuses on developing the computer programs and algorithms that make predictions and learn from the provided data.

Get certified from top Big Data & Spark course in Singapore Now!

Dataframes

What are dataframes?

A dataframe is the new API for Apache Spark. It is basically a distributed, Strongly-typed collection of data, that is, a dataset which is organised into named columns. Dataframe is equivalent to what a table is for relational database, only, it has richer optimization options.

How to create dataframes

There are multiple ways to create dataframes in Apache Spark:

  • Dataframes can be created using an existing RDD
  • You can create a dataframe by loading a CSV file directly
  • You can programmatically specify a schema to create a dataframe as well

In this tutorial, we are going to use dataframes created using an existing CSV file.

What is Pyspark MLlib?

Basic intro to PySpark MLlib

Spark MLlib is short for spark machine learning library. Machine learning in PySpark is easy to use and scalable. It works on distributed systems. We use in Spark machine learning for data analysis. We get the benefit of various machine learning algorithms such as Regression, classification etc, because of the PySpark MLLIB.

Parameters in PySpark MLlib

Some of the main parameters of PySpark MLlib are listed below:

  • Ratings: This parameter is used to create an RDD of ratings, rows or tuples
  • Rank: It shows the number of features computed and ranks them
  • Lambda: Lambda is a regularization parameter
  • Blocks: Blocks are used to parallel the number of computations. The default value for this is -1

Watch this Apache Spark for beginners video by Intellipaat

Machine Learning with PySpark Tutorial Introduction to Spark MLlib Apache Spark comes with a library named MLlib to perform machine learning tasks using spark framework. Since we have a Python API for Apache spark, that is, as you already know, PySpark, we can also use this spark ml library in PySpark. MLlib contains many algorithms and machine

Performing Linear Regression on a real world Dataset

Let’s understand machine learning better by implementing a full-fledged code to perform linear Regression on the dataset of top 5 Fortune 500 Companies in year 2017.

Go through this Spark Interview Questions And Answers to excel in your Apache Spark Interview.

Loading the data:

As mentioned above, we are going to use a dataframe that we have created directly from a CSV file. Following are the commands to load the data into a dataframe and to view the loaded data.
Input:
In [1]:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)

In [2]:

company_df = sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’true’, inferschema=’true’).load(‘C:/Users/intellipaat/Downloads/spark-2.3.2-bin-hadoop2.7/Fortune5002017.csv’)
company_df.take(1)

You can choose the number of rows you want to view while displaying the data of a dataframe. I have displayed the first row only.
Output:
Out[2]:

[Row (Rank=1, Title= ‘Walmart’, Website= ‘http:/www.walmart.com’, Employees-2300000, Sector= ‘retailing’)]

Data exploration:

To check the datatype of every column of a dataframe and print the schema of the dataframe in a tree format, you can use the following commands respectively.
Input:
In[3]:

company_df.cache()
company_df.printSchema()

Output:
Out [3]:

DataFrame[Rank: int, Title: string, Website: string, Employees: Int, Sector: string]
root
|– Rank:  integer (nullable = true)
|– Title:  string (nullable = true)
|– Website:  string (nullable = true)
|– Employees:  integer (nullable = true)
|– Sector:  string (nullable = true)

Become an Apache Spark Specialist by going through this Big Data & Spark Online Course in London

Performing Descriptive Analysis:

Input:
In [4]:

company_df.describe().toPandas().transpose()

Output:
Out [4]:

01234
Summarycountmeanstddevminmax
Rank53.01.58113883008415
Title5NoneNoneAppleWalmart
Website5NoneNonewww.apple.comwww.walmart.com
Employees5584880.0966714.2168190142680002300000
Sector5NoneNoneEnergyWholesalers

Finding correlation between independent variables:

To find out if any of the variables, that is, the fields have any correlations or dependencies, we can plot a scatter matrix. Plotting a scatter matrix is one of the best ways in machine learning to identify linear correlations, if any.
You can plot a scatter matrix on your dataframe using the following code
Input:
In [5]:

import pandas as pdnumeric_features = [t[0] for t in company_df.dtypes if t[1] == ‘int’ or t[1] == ‘double’]
sampled_data = company_df.select(numeric_features).sample(False, 0.8).toPandas()
axs = pd.scatter_matrix(sampled_data, figsize=(10, 10))
n = len(sampled_data.columns)
for i in range(n):
v = axs[i, 0]
v.yaxis.label.set_rotation(0)
v.yaxis.label.set_ha(‘right’)
v.set_yticks(())
h = axs[n-1, i]
h.xaxis.label.set_rotation(90)
h.set_xticks(())

Output:
Out [5]:
Out[5]
Here we can draw the conclusion that in our dataset, the “Rank” and “Employees” columns have a correlation. Let’s dig a little deeper into finding the correlation specifically between these two columns.

For any doubts or queries related to Spark and Hadoop kindly refer our Big Data Hadoop & Spark Community.

Correlation between independent variables:

Input:
In [6]:

import six
for i in comapny_df.columns:
if not( isinstance(company_df.select(i).take(1)[0][0], six.string_types)):
print( “Correlation to Employees for “, i, company_df.stat.corr(‘Employees’,i))

Output:
Out [6]:

Correlation to Employees for Rank   -0.778372714650932
Correlation to Employees  1.0

The value of correlation ranges from -1 to 1, the closer it is to 1 the more positive correlation can be found between the fields. If the value is closer to -1, it means that there is a strong negative correlation between the fields. Now you can analyse your output and see if there is correlation or not, and if there is, then if it is a strong positive or negative correlation.

Preparing Data:

In [7]:

from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = [‘Rank’, ‘Employees’], outputCol = ‘features’)
tcompany_df = vectorAssembler.transform(company_df)
tcompany_df = tcompany_df.select([‘features’, ‘Employees’])
tcompany_df.show(3)

Out [7]:
output 7
In [8]:

splits = tcompany_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

Linear Regression:

Input:
In [10]:

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = ‘features’, labelCol=’Emplyees’, maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print(“Coefficients: ” + str(lr_model.coefficients))
print(“Intercept: ” + str(lr_model.intercept))

 

Output:

Out [10]:

Coefficients: [-32251.88812374517, 0.9255193858709874]
Intercept: 140317.88600801243

After performing linear regression on our dataset, we finally come to the conclusion that “Employees” is the most important field or feature in our given dataset, if we want to predict the ranks of the companies in coming future. “Ranks” has a linear correlation with “Employees”, indicating that the number of employees in a particular year, in the companies in our dataset has a direct impact on the Rank of those companies.

Enhance your skills in Apache Spark by going through this Big Data & Spark Training

Machine learning in Industry

Computer systems with the ability to predict and learn from a given data and improve themselves without having to be reprogrammed used to be a dream only but in the recent years it has been made possible using machine learning. Now machine learning is a most used branch of artificial intelligence that is being accepted by big industries in order to benefit their businesses.
Following are some of the organisations where machine learning has various use cases:

  • PayPal:PayPal uses machine learning to detect suspicious activity.
  • IBM: There is a machine learning technology patented by IBM which helps to decide when to handover the control of  self-driving vehicle between a vehicle control processor and a human driver
  • Google:Machine learning is used to gather information from the users which further is used to improve their search engine results
  • Walmart: Machine learning in Walmart is used to benefit their efficiency
  • Amazon: Machine learning is used to design and implement personalised product recommendations
  • Facebook: Machine learning is used to filter out poor quality content.

Conclusion

Machine learning denotes a step forward in how computers can learn and make predictions. It has applications in various sectors and is being extensively used. Having knowledge of machine learning will not only open multiple doors of opportunities for you, but it also makes sure that if you have mastered machine learning then you are never out of job since machine learning has been gaining popularity since it came into the picture and it won’t stop any time soon. So, without any further ado, check out the Machine Learning certification by Intellipaat and get started with Machine learning today!

That would be all for this tutorial, we hope you got to learn something here.

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *