Introduction to Spark Components

The following procedure gives the clear picture of the different Spark components.

components of spark

Apache Spark Core

Spark Core consists of general execution engine for spark platform that all required by other functionality which is built upon as per the requirement approach. It provides in-built memory computing and referencing datasets stored in external storage systems.

Fast-track your career by going through this Big Data & Spark Training Course.

Check out this insightful video on Spark Tutorial For Beginners

Spark allows the developers to write code quickly with the help of rich set of operators. While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark Scala. Following program will help you understand the way programming is done with Spark :

sparkContext.textFile("hdfs://...")

            .flatMap(line => line.split(” “))

            .map(word => (word, 1)).reduceByKey(_ + _)

            .saveAsTextFile(“hdfs://…”)

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new set of data abstraction called Schema RDD, which provides support for both the structured and semi-structured data.

Get certified from top Big Data & Spark course in New York Now!

Below is an example of a Hive compatible query:

// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”)

sqlContext.sql(“LOAD DATA LOCAL INPATH ‘examples/src/main/resources/kv1.txt’ INTO TABLE src”)

// Queries are expressed in HiveQL

sqlContext.sql(“FROM src SELECT key, value”).collect().foreach(println)

Spark Streaming

This spark component allows Spark to process real-time streaming data. It provides an API to manipulate data streams that matches with RDD API. It allows the programmers to understand the project and switch through the applications that manipulate the data and giving outcome in real-time. Similar to Spark Core, Spark Streaming strives to make the system fault-tolerant and scalable.

If you have any query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.

RDD API Examples

In this example, we will use a few transformations that are implemented to build a dataset of (String, Int) pairs called counts and then save it to a file.

text_file = sc.textFile("hdfs://...")

counts = text_file.flatMap(lambda line: line.split(” “)) \

             .map(lambda word: (word, 1)) \

             .reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile(“hdfs://…”)

MLlib (Machine Learning Library)

Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms, classification, clustering and collaboration filters, etc. It also includes few lower-level primitives. All these functionalities help Spark scale out across a cluster.

Prediction with Logistic Regression

In this example, we take a dataset values in terms of labels and feature vectors. We learn to predict the labels from feature vectors using the method of Logistic Regression algorithm using the python language:

# Every record of this DataFrame contains the label and

# features represented by a vector.

df = sqlContext.createDataFrame(data, [“label”, “features”])

# Set parameters for the algorithm.

# Here, we limit the number of iterations to 10.

lr = LogisticRegression(maxIter=10)

# Fit the model to the data.

model = lr.fit(df)

# Given a dataset, predict each point’s label, and show the results.

model.transform(df).show()

Prepare yourself with the Top Spark Interview Questions And Answers Now!

GraphX

Spark also comes with a library to manipulate the graphs and performing computations, called as GraphX. Just like Spark Streaming and Spark SQL, GraphX also extends Spark RDD API which creates a directed graph. It also contains numerous operators in order to manipulate the graphs along with graph algorithms.

Consider the following example to model users and products as a bipartite graph we might follow:

class VertexProperty()

case class UserProperty(val name: String) extends VertexProperty

case class ProductProperty(val name: String, val price: Double) extends VertexProperty

// The graph might then have the type:

var graph: Graph[VertexProperty, String] = null

Intellipaat provides the most comprehensive Big Data & Spark Course in Bangalore to fast-track your career!

Previous Next

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *