• Articles
  • Tutorials
  • Interview Questions

Programming with RDD in Spark

Table of content

Show More

Resilient Distributed Datasets (RDDs)

RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

Resilient Distributed Datasets (RDDs)

An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.

Check out this insightful video on Spark Tutorial for Beginners:

Video Thumbnail

Features of an RDD in Spark

Here are some features of RDD in Spark:

  • Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance.
  • Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
  • Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.

Features of an RDD in Spark

  • Immutability: Data stored in an RDD is in the read-only mode━you cannot edit the data which is present in the RDD. But, you can create new RDDs by performing transformations on the existing RDDs.
  • In-memory computation: An RDD stores any immediate data that is generated in the memory (RAM) than on the disk so that it provides faster access.
  • Partitioning: Partitions can be done on any existing RDD to create logical parts that are mutable. You can achieve this by applying transformations to the existing partitions.

Differentiation: RDD vs Datasets vs DataFrame

Basis RDD Datasets DataFrame
Inception Year RDD came into existence in the year 2011. Datasets entered the market in the year 2013. DataFrame came into existence in the year 2015.
Meaning RDD is a collection of data where the data elements are distributed without any schema Datasets are distributed collections where the data elements are organized into the named columns. Datasets are basically the extension of DataFrames with added features
Optimization In case of RDDs, the developers need to manually write the optimization codes. Datasets use catalyst optimizers for optimization. Even in the case of DataFrames, catalyst optimizers are used for optimization.
Defining the Schema In RDDs, the schema needs to be defined manually. The schema is automatically defined in case of Datasets The schema is automatically defined in DataFrame

Get 100% Hike!

Master Most in Demand Skills Now!

Operations on RDDs

There are two basic operations that can be done on RDDs. They are transformations and actions.

Operations on RDDs

Transformations

These are functions that accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable. Some of the transformation operations are provided in the table below:

Function Description
map() Returns a new RDD by applying the function on each data element
filter() Returns a new RDD formed by selecting those elements of the source on which the function returns true
reduceByKey() Aggregates the values of a key using a function
groupByKey() Converts a (key, value) pair into a (key, <iterable value>) pair
union() Returns a new RDD that contains all elements and arguments from the source RDD
intersection() Returns a new RDD that contains an intersection of the elements in the datasets

Actions

Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver. Actions are operations that provide non-RDD values. Some of the common actions used in Spark are given below:

Function Description
count() Gets the number of data elements in an RDD
collect() Gets all the data elements in an RDD as an array
reduce() Aggregates data elements into an RDD by taking two arguments and returning one
take(n) Fetches the first n elements of an RDD
foreach(operation) Executes the operation for each data element in an RDD
first() Retrieves the first data element of an RDD

Creating an RDD

An RDD can be created in three ways. Let’s discuss them one by one.

By Loading an External Dataset

You can load an external file onto an RDD. The types of files you can load are csv, txt, JSON, etc. Here is the example of loading a text file onto an RDD:

Loading an External Dataset

By Parallelizing the Collection of Objects

When Spark’s parallelize method is applied to a group of elements, a new distributed dataset is created. This dataset is an RDD.

Below, you can see how to create an RDD by applying the parallelize method to a collection that consists of six elements:

Parallelizing the Collection of Objects

By Performing Transformations on the Existing RDDs

One or more RDDs can be created by performing transformations on the existing RDDs as mentioned earlier in this tutorial page. The below figure shows how a map() function can be used to create an RDD:

Performing Transformations on the Existing RDDs

However, the data inside RDDs are not always organized or structured since the data is stored from different sources.

In a further section of this Apache Spark tutorial, you will learn about Spark SQL that organizes data into rows and columns. Besides, you will come to know about Spark SQL libraries that provide APIs to connect to Spark SQL through JDBC/ODBC connections and perform queries (table operations) on structured data, which is not possible in an RDD in Spark. Stay tuned!

Advantages of RDD

There are multiple advantages of RDD in Spark. We have covered few of the important ones in this article below :

  • RDD aids in increasing the execution speed of Spark.
  • RDDs are the basic unit of parallelism and hence help in achieving the consistency of data.
  • RDDs help in performing and saving the actions separately
  • They are persistent as they can be used repeatedly.
Youtube subscribe

Limitation of RDD

  • There is no input optimization available in RDDs
  • One of the biggest limitations of RDDs is that the execution process does not start instantly.
  • No changes can be made in RDD once it is created.
  • RDD lacks enough storage memory.
  • The run-time type safety is absent in RDDs.

Course Schedule

Name Date Details
Big Data Course 30 Nov 2024(Sat-Sun) Weekend Batch View Details
07 Dec 2024(Sat-Sun) Weekend Batch
14 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.