Introduction to Spark Components
The following image gives you a clear picture of the different Spark components.
Apache Spark Core
Apache Spark Core consists of a general execution engine for the Spark platform which is built as per the requirement. It provides in-built memory computing and references datasets stored in external storage systems.
Check out this insightful video on Spark Tutorial for Beginners:
Spark allows developers to write code quickly with the help of a rich set of operators. While it takes a lot of lines of code in other programming languages, it takes fewer lines when written in Spark Scala. The following program will help you understand the way programming is done with Spark:
sparkContext.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new set of data abstraction called SchemaRDD. SchemaRDD provides support for both structured and semi-structured data.
Below is the example of a Hive compatible query:
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
Spark Streaming
It is one of the Apache Spark components, and it allows Spark to process real-time streaming data. It provides an API to manipulate data streams that match with the RDD API. It allows programmers to understand the project and switch through the applications that manipulate the data and give the outcome in real time. Similar to Spark Core, Spark Streaming strives to make the system fault-tolerant and scalable.
RDD API Example:
In this example, you will use a few transformations that are implemented to build a dataset of (String, Int) pairs called counts and then save it to a file.
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
MLlib (Machine Learning Library)
Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of Machine Learning algorithms, classification, clustering, and collaboration filters, etc. It also includes a few low-level primitives. All these functionalities help Spark scale out across a cluster.
Prediction with Logistic Regression
In this example, you will take dataset values in terms of labels and feature vectors. You will learn to predict the labels from feature vectors using the method of logistic regression with Python:
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, you need to limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
GraphX
Spark also comes with a library to manipulate graphs and perform computations, which is called GraphX. Just like Spark Streaming and Spark SQL, GraphX also extends Spark RDD API, which creates a directed graph. It also contains numerous operators in order to manipulate the graphs, along with graph algorithms.
Consider the following example to model users and products as a bipartite graph:
class VertexProperty()
case class UserProperty(val name: String) extends VertexProperty
case class ProductProperty(val name: String, val price: Double) extends VertexProperty
// The graph might then have the type:
var graph: Graph[VertexProperty, String] = null