MLlIB Cheat Sheet

Introduction

It is a Machine Learning library that includes learning algorithms and utilities which help the programmers easily practice and use Machine Learning. To work with Machine Learning, one must know the basic concepts and the algorithms required to start with it.
The machine Learning cheat sheet will guide you with all the basic concepts and libraries of Machine Learning you need to know. It is helpful for beginners as well as experienced people to easily understand what is Machine Learning and what are its libraries.

 

Further, if you want to learn ML in-depth, you can refer to the Machine Learning Tutorial.
You can also download the printable PDF of this MLlib cheat sheet
MLib cheat sheet design
MLlib: It is an Apache Spark machine learning library that is scalable; it consists of popular algorithms and utilities
Observations: The items or data points used for learning and evaluating
Features: The characteristic or attribute of an observation
Labels: The values assigned to observation are called a Label
Training or test data: A learning algorithm is an observation used for training and testing the data
Data Source: Access to HDFS and HBase can be done using MLlib, which enables MLlib to be plugged into the Hadoop Work process.

MLlib Packages

MLlib contains two packages

  • mllib
  • ml

To add the MLlib the following library is imported:

    • In Scala:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
    • In Java:
importapache.spark.mllib.linalg.Vector;
    • In python:
frommllib.linalgimportSparseVector
from pyspark.mllib.regression import LabeledPoint

Certification in Bigdata Analytics

Spark MLlib Tools

  • ML Algorithm: These include common learning algorithms such as classification, clustering, regression, and collaborative filtering. These algorithms form the core of MLlib
  • Featurization: It includes feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines: Pipelines provide tools for constructing, evaluating, and tuning ML pipelines
  • Persistence: It helps in saving and loading algorithms, models, and pipelines
  • Utilities: It provides utilities for linear algebra, statistics, and data handling

Spark MLlib Tools

MLlib algorithms

These include the popular algorithms and utilities

  • Learn Statistics: It includes the most basic of the machine learning techniques such as:
    • Summary statistics
    • Correlation
    • Stratified sampling
    • Hypothesis testing
  • Logistic Regression using R: It is a statistical approach to estimating the relationship among variables. It is widely used for prediction and forecasting
  • Classification In Machine Learning: It is used to identify to which set of categories a new observation belongs.
  • K-means classification: It is used for classification using MLlib in Java. It is used to classify every observation, experiment, or vector into one of the clusters.
  • Recommendation system: it is a subclass of information filtering systems that seeks to predict the preference or rating a person can give to an item. This can be done in two ways
    • Collaborative filtering: It approaches building a model from a user’s past behavior as well as similar decisions made by the user. The model is then used to predict the items in which the user might have an interest
    • Content-based filtering: It approaches to utilizes a series of discrete characteristics of an item to recommend more items with similar properties
  • K-means Clustering: It is a task to group a set of objects in a way that the objects in the same group are more similar to each other when compared to the objects in the other group.
  • Dimensionality Reduction: It is a process of reducing a set of random variables under consideration by obtaining a set of principal variables. It can be divided into two types
    • Feature selection: It finds a subset of original variables called attributes
    • Feature Extraction: This will transform the data from in high-dimensional space to a space of fewer dimensions.
  • Feature extraction: It starts from an initial set of derived data and builds derived values.
  • Optimization: It is a selection of the best elements from the set of available alternatives

MLlib algorithms

MLib components

Main concepts in Pipeline

MLlib is used to standardize the APIs for easy use of multiple algorithms being used as a single pipeline or a workflow

  • Data frame: The ML API uses Dataframe from Spark SQL as a dataset, which can be used to hold a variety of datatypes
  • Transformer: This is used to transform one Dataframe into another Dataframe. Examples are
    • Hashing Term Frequency: This calculates how a word occurs
    • Logistic Regression Model: The model which results from trying logistic regressions on a dataset
    • Binarizer: This changes a given threshold value to 1 or 0

Get 100% Hike!

Master Most in Demand Skills Now!

  • Estimator: It is an algorithm that can be used on a Dataframe to produce a Transformer. Examples are:
    • Logistic Regression: It is used to determine the weights for the resulting Logistic Regression Model by processing the dataframe
    • StandardScaler: It is used to calculate the Standard deviation
    • Pipeline: Calling fit on a pipeline produces a pipeline model, and the pipeline contains only transformers and not the estimators
  • Pipeline: A pipeline chains multiple Transformers and Estimators together to specify the ML workflow
  • Parameters: To specify the parameters a common API is used by the Transformers and Estimators

Main concepts in Pipeline

MLlib work process

Become a Big Data Architect

Download a Printable PDF of this Cheat Sheet

With this, we come to the end of the MLlib Cheatsheet. To get in-depth knowledge, check out our interactive, live-online Machine Learning Certification course here, which comes with 24*7 support to guide you throughout your learning period. Intellipaat’s Machine Learning certification training course includes the concepts and techniques of machine learning algorithms, supervised and unsupervised learning, probability, statistics, decision tree, random forest, linear and logistic regression through real-world hands-on projects

We hope this tutorial helps you gain knowledge of Machine Learning Training. If you are looking to learn Online Machine Learning Course in a systematic manner with expert guidance and support then you can enroll to our Machine Learning Course Online.

Our Machine Learning Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 11th Jan 2025
₹70,053
Cohort starts on 1st Feb 2025
₹70,053

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.