Introduction to Spark Components
The following procedure gives the clear picture of the different Spark components.
Apache Spark Core
Spark Core consists of general execution engine for spark platform that all required by other functionality which is built upon as per the requirement approach. It provides in-built memory computing and referencing datasets stored in external storage systems.
Check out this insightful video on Spark Tutorial For Beginners
Fast-track your career by going through this Big Data & Spark Training Course.
Spark allows the developers to write code quickly with the help of rich set of operators. While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark Scala. Following program will help you understand the way programming is done with Spark :
.flatMap(line => line.split(” “))
.map(word => (word, 1)).reduceByKey(_ + _)
Get certified from top Big Data & Spark course in New York Now!
Spark SQL is a component on top of Spark Core that introduces a new set of data abstraction called Schema RDD, which provides support for both the structured and semi-structured data.
Below is an example of a Hive compatible query:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)
sqlContext.sql(“LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt’ INTO TABLE src”)
// Queries are expressed in HiveQL
sqlContext.sql(“FROM src SELECT key, value”).collect().foreach(println)
This spark component allows Spark to process real-time streaming data. It provides an API to manipulate data streams that matches with RDD API. It allows the programmers to understand the project and switch through the applications that manipulate the data and giving outcome in real-time. Similar to Spark Core, Spark Streaming strives to make the system fault-tolerant and scalable.
RDD API Examples
In this example, we will use a few transformations that are implemented to build a dataset of (String, Int) pairs called counts and then save it to a file.
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(” “)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
If you have any query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.
MLlib (Machine Learning Library)
Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms, classification, clustering and collaboration filters, etc. It also includes few lower-level primitives. All these functionalities help Spark scale out across a cluster.
Prediction with Logistic Regression
In this example, we take a dataset values in terms of labels and feature vectors. We learn to predict the labels from feature vectors using the method of Logistic Regression algorithm using the python language:
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, [“label”, “features”])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point’s label, and show the results.
Prepare yourself with the Top Spark Interview Questions And Answers Now!
Spark also comes with a library to manipulate the graphs and performing computations, called as GraphX. Just like Spark Streaming and Spark SQL, GraphX also extends Spark RDD API which creates a directed graph. It also contains numerous operators in order to manipulate the graphs along with graph algorithms.
Consider the following example to model users and products as a bipartite graph we might follow:
case class UserProperty(val name: String) extends VertexProperty
case class ProductProperty(val name: String, val price: Double) extends VertexProperty
// The graph might then have the type:
var graph: Graph[VertexProperty, String] = null
Intellipaat provides the most comprehensive Big Data & Spark Course in Bangalore to fast-track your career!