Introduction:
It is a Machine Learning library which includes learning algorithms and utilities which helps the programmers to easily practice and use Machine Learning. To work with Machine Learning, one must know the basic concepts and the algorithms required to start with it.
This cheat sheet will guide you with all the basic concepts and libraries of Machine Learning you need to know. It is helpful for the beginners as well as experienced people to easily understand what is Machine Learning and what are its libraries.
If you have any doubts or queries related to Data Science, do post on Machine Learning Community.
Further, if you want to learn ML in depth, you can refer to the Machine Learning Tutorial.
You can also download the printable PDF of this MLlib cheat sheet
MLlib: It is an Apache Spark machine learning library which is scalable; it consists of popular algorithms and utilities
Observations: The items or data points used for learning and evaluating
Features: The characteristic or attribute of an observation
Labels: The values assigned to an observation is called a Label
Training or test data: A learning algorithm is an observation used for training and testing of the data
Data Source: Access to HDFS and HBase can be done using MLlib, which enables MLlib to be plugged in Hadoop Work process
MLlib Packages:
MLlib contains two packages
 mllib
 ml
To add the MLlib the following library is imported:

 In Scala:
import org.apache.spark.mllib.linalg.{Vector, Vectors}

 In Java:
importapache.spark.mllib.linalg.Vector;

 In python:
frommllib.linalgimportSparseVector from pyspark.mllib.regression import LabeledPoint
Go through this Artificial Intelligence Interview Questions And Answers to excel in your Artificial Intelligence Interview.
Spark MLlib Tools:
 ML Algorithm: These include common learning algorithms such as classification, clustering, regression and collaborative filtering. These algorithms form the core of MLlib
 Featurization: It includes feature extraction, transformation, dimensionality reduction and selection
 Pipelines: Pipelines provide tools for constructing, evaluating and tuning ML pipelines
 Persistence: It helps in saving and loading algorithms, models and pipelines
 Utilities: It provides utilities for linear algebra, statistics and data handling
MLlib algorithms:
These include the popular algorithms and utilities
 Learn Statistics: It includes the most basic of the machine learning techniques such as:
 Summary statistics
 Correlation
 Stratified sampling
 Hypothesis testing
 Logistic Regression using R: It is a statistical approach to estimate the relationship among variables. It is widely used for prediction and forecasting
 Classification In Machine Learning: It is used to identify to which set of categories a new observation belongs to.
 Kmeans classification: It is used for classification using MLlib in Java. It is used to classify every observation, experiment or a vector into one of the cluster
 Recommendation system: it is a sub class of information filtering system that seeks to predict the preference or rating a person can give to an item. This can be done in two ways
 Collaborative filtering: It approaches in building a model from a userâ€™s past behavior as well as similar decisions made by the user. The model is then used to predict the items in which the user might have interest
 Contentbased filtering: It approaches to utilize a series of discrete characteristics of an item in order to recommend more items with similar properties
 Kmeans Clustering: It is a task to group set of objects in a way that the objects in the same group is more similar to each other when compared to the objects in the other group.
 Dimensionality Reduction: It is a process of reducing a set of random variables under consideration by obtaining a set of principal variables. It can be divided into two types
 Feature selection: It finds a subset of original variables called attributes
 Feature Extraction: This will transform the data from in a high dimensional space to a space of fewer dimensions.
 Feature extraction: It starts from initial set of derived data and builds derived values.
 Optimization: It is a selection of best element from the set of available alternatives
MLib components
Interested in learning Machine Learning? Click here to learn more in this Machine Learning Training in Bangalore!
Main concepts in Pipeline:
MLlib is used to standardize the APIs for easy use of multiple algorithms being used as a single pipeline or a workflow
 Data frame: The ML API uses Dataframe from Spark SQL as a dataset, which can be used to hold a variety of datatypes
 Transformer: This is used to transform one Dataframe to another Dataframe. Examples are
 Hashing Term Frequency: This calculates how word occurs
 Logistic Regression Model: The model which results from trying logistic regressions on a dataset
 Binarizer: This changes a given threshold value to 1 or 0
 Estimator: It is an algorithm which can be used on a Dataframe to produce Transformer. Examples are:
 Logistic Regression: It is used to determine the weights for the resulting Logistic Regression Model by processing the dataframe
 StandardScaler: It is used to calculate the Standard deviation
 Pipeline: Calling fit on a pipeline produces pipeline model, and the pipeline contains only transformers and not the estimators
 Pipeline: A pipeline chains multiple Transformers and Estimators together to specify the ML workflow
 Parameters: To specify the parameters a common API is used by the Transformers and Estimators
Become Master of Machine Learning by going through this online Machine Learning course in Singapore.
MLlib work process
Download a Printable PDF of this Cheat Sheet
With this, we come to an end of MLlib Cheat sheet. To get indepth knowledge, check out our interactive, liveonline Machine Learning Certification course here, that comes withÂ 24*7 support to guide you throughout your learning period. Intellipaatâ€™s Machine Learning certification training course includes the concepts and techniques of machine learning algorithms, supervised and unsupervised learning, probability, statistics, decision tree, random forest, linear and logistic regression through realworld handson projects
Watch this Machine Learning Interview Questions Tutorial
Learn Machine Learning from experts, click here to more in this MachineÂ Learning Training in London!