It is a Machine Learning library which includes learning algorithms and utilities which helps the programmers to easily practice and use Machine Learning. To work with Machine Learning, one must know the basic concepts and the algorithms required to start with it.
This cheat sheet will guide you with all the basic concepts and libraries of Machine Learning you need to know. It is helpful for the beginners as well as experienced people to easily understand what is Machine Learning and what are its libraries.

If you have any doubts or queries related to Data Science, do post on Machine Learning Community.

Further, if you want to learn ML in depth, you can refer to the Machine Learning Tutorial.
You can also download the printable PDF of this MLlib cheat sheet
MLib cheat sheet design
MLlib: It is an Apache Spark machine learning library which is scalable; it consists of popular algorithms and utilities
Observations: The items or data points used for learning and evaluating
Features: The characteristic or attribute of an observation
Labels: The values assigned to an observation is called a Label
Training or test data: A learning algorithm is an observation used for training and testing of the data
Data Source: Access to HDFS and HBase can be done using MLlib, which enables MLlib to be plugged in Hadoop Work process

MLlib Packages:

MLlib contains two packages

  • mllib
  • ml

To add the MLlib the following library is imported:

    • In Scala:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
    • In Java:
    • In python:
from pyspark.mllib.regression import LabeledPoint

Certification in Bigdata Analytics

Go through this Artificial Intelligence Interview Questions And Answers to excel in your Artificial Intelligence Interview.

Spark MLlib Tools:

  • ML Algorithm: These include common learning algorithms such as classification, clustering, regression and collaborative filtering. These algorithms form the core of MLlib
  • Featurization: It includes feature extraction, transformation, dimensionality reduction and selection
  • Pipelines: Pipelines provide tools for constructing, evaluating and tuning ML pipelines
  • Persistence: It helps in saving and loading algorithms, models and pipelines
  • Utilities: It provides utilities for linear algebra, statistics and data handling

Spark MLlib Tools

MLlib algorithms:

These include the popular algorithms and utilities

  • Learn Statistics: It includes the most basic of the machine learning techniques such as:
    • Summary statistics
    • Correlation
    • Stratified sampling
    • Hypothesis testing
  • Logistic Regression using R: It is a statistical approach to estimate the relationship among variables. It is widely used for prediction and forecasting
  • Classification In Machine Learning: It is used to identify to which set of categories a new observation belongs to.
    • K-means classification: It is used for classification using MLlib in Java. It is used to classify every observation, experiment or a vector into one of the cluster
  • Recommendation system: it is a sub class of information filtering system that seeks to predict the preference or rating a person can give to an item. This can be done in two ways
    • Collaborative filtering: It approaches in building a model from a user’s past behavior as well as similar decisions made by the user. The model is then used to predict the items in which the user might have interest
    • Content-based filtering: It approaches to utilize a series of discrete characteristics of an item in order to recommend more items with similar properties
  • K-means Clustering: It is a task to group set of objects in a way that the objects in the same group is more similar to each other when compared to the objects in the other group.
  • Dimensionality Reduction: It is a process of reducing a set of random variables under consideration by obtaining a set of principal variables. It can be divided into two types
    • Feature selection: It finds a subset of original variables called attributes
    • Feature Extraction: This will transform the data from in a high dimensional space to a space of fewer dimensions.
  • Feature extraction: It starts from initial set of derived data and builds derived values.
  • Optimization: It is a selection of best element from the set of available alternatives

MLlib algorithms

MLib components

Interested in learning Machine Learning? Click here to learn more in this Machine Learning Training in Bangalore!

Main concepts in Pipeline:

MLlib is used to standardize the APIs for easy use of multiple algorithms being used as a single pipeline or a workflow

  • Data frame: The ML API uses Dataframe from Spark SQL as a dataset, which can be used to hold a variety of datatypes
  • Transformer: This is used to transform one Dataframe to another Dataframe. Examples are
    • Hashing Term Frequency: This calculates how word occurs
    • Logistic Regression Model: The model which results from trying logistic regressions on a dataset
    • Binarizer: This changes a given threshold value to 1 or 0
  • Estimator: It is an algorithm which can be used on a Dataframe to produce Transformer. Examples are:
    • Logistic Regression: It is used to determine the weights for the resulting Logistic Regression Model by processing the dataframe
    • StandardScaler: It is used to calculate the Standard deviation
    • Pipeline: Calling fit on a pipeline produces pipeline model, and the pipeline contains only transformers and not the estimators
  • Pipeline: A pipeline chains multiple Transformers and Estimators together to specify the ML workflow
  • Parameters: To specify the parameters a common API is used by the Transformers and Estimators

Become Master of Machine Learning by going through this online Machine Learning course in Singapore.

Main concepts in Pipeline

MLlib work process

Become a Big Data Architect

Download a Printable PDF of this Cheat Sheet

With this, we come to an end of MLlib Cheat sheet. To get in-depth knowledge, check out our interactive, live-online Machine Learning Certification course here, that comes with 24*7 support to guide you throughout your learning period. Intellipaat’s Machine Learning certification training course includes the concepts and techniques of machine learning algorithms, supervised and unsupervised learning, probability, statistics, decision tree, random forest, linear and logistic regression through real-world hands-on projects

Watch this Machine Learning Interview Questions Tutorial

MLlIB Cheat Sheet

Learn Machine Learning from experts, click here to more in this Machine Learning Training in London!

Course Schedule

Name Date
Data Science Course 2021-07-31 2021-07-25
(Sat-Sun) Weekend batch
View Details
Data Science Course 2021-08-07 2021-08-01
(Sat-Sun) Weekend batch
View Details
Data Science Course 2021-08-14 2021-08-08
(Sat-Sun) Weekend batch
View Details

Leave a Reply

Your email address will not be published. Required fields are marked *