Flat 10% & upto 50% off + 10% Cashback + Free additional Courses. Hurry up

Resilient Distributed Datasets (RDDs)

RDDs are the main logical data unit in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

Resilient Distributed Datasets (RDDs)

RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. They are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.

Read about Apache Spark from Cloudera Spark Training and be master as an Apache Spark Specialist.

Check out this insightful video on Spark Tutorial For Beginners

Features of RDD in Spark

  • Resilient: RDDs track data lineage information to recover the lost data, automatically on failure. It is also called Fault tolerance.
  • Distributed: Data present in the RDD resides on multiple nodes. It is distributed across different nodes of a cluster.
  • Lazy Evaluation: Data does not get loaded in the RDD even if we define it. Transformations are actually computed when you call an action, like count or collect, or save the output to a file system.

Features of RDD

      • Immutability: Data stored in the RDD is in a read-only mode━you cannot edit the data which is present in the RDD. But, you can create new RDDs by performing transformations on the existing RDDs.
      • In-memory Computation: RDD stores any immediate data that is generated in the memory (RAM) than on the disk so that it provides faster access.
      • Partitioning: Partitions can be done on any existing RDDs to create logical parts that are mutable. You can achieve this by applying transformations on existing partitions.

Operations on RDDs

Operations on RDDs

There are two basic operations which can be done on RDDs. They are:

  • Transformations
  • Actions

Want to grasp a detailed knowledge on Hadoop? Read this extensive Spark Tutorial!

Transformations: These are functions which accept existing RDDs as the input and outputs one or more RDDs. The data in the existing RDD in Spark does not change as it is immutable. Some of the transformation operations are shown in the table given below:

map()Returns a new RDD by applying the function on each data element
filter()Returns a new RDD formed by selecting those elements of the source on which the function returns true
reduceByKey()Used to aggregate values of a key using a function
groupByKey()Used to convert a (key, value) pair to (key, <iterable value>) pair
union()Returns a new RDD that contains all elements and arguments from the source RDD
intersection()Returns a new RDD that contains an intersection of elements in the datasets

These transformations are executed when they are invoked or called. Every time transformations are applied, a new RDD is created.

If you have any query related to Spark and Hadoop, kindly refer our Big Data Hadoop & Spark Community.

Actions: Actions in Spark are functions which return the end result of RDD computations. It uses a lineage graph to load the data onto the RDD in a particular order. After all transformations are done, actions return the final result to the Spark Driver. Actions are operations which provide non-RDD values. Some of the common actions used in Spark are:

count()Gets the number of data elements in an RDD
collect()Gets all data elements in the RDD as an array
reduce()Aggregates data elements into the RDD by taking two arguments and returning one
take(n)Used to fetch the first n elements of the RDD
foreach(operation)Used to execute operation for each data element in the RDD
first()Retrieves the first data element of the RDD

Creating an RDD

An RDD can be created in three ways:

  • By loading an external dataset

You can load an external file into an RDD. The types of files you can load are csv, txt, JSON, etc. Here is an example of loading a text file into an RDD.


  • By parallelizing the collection of objects

When Spark’s parallelize method is applied on a group of elements, a new distributed dataset is created. This is called an RDD.

Prepare yourself for the Top Hadoop Interview Questions And Answers Now!

Here, we are creating an RDD by applying the parallelize method on a collection which consists of six elements.

By loading an external dataset

  • By performing transformations on existing RDDs

One or more RDDs can be created by performing transformations on the existing RDDs. The below figure shows how a map() function can be used.


The data inside RDDs are not always organized or structured, since the data is stored from various different sources. So, in the coming sections, we will talk about Spark SQL which organizes the data into rows and columns. And also, Spark SQL libraries provide APIs to connect to Spark SQL through JDBC/ODBC connections and perform queries (table operations) on the structured data, which is not possible in an RDD in Spark.

Intellipaat provides the most comprehensive Cloudera Spark course to fast-track your career!

Previous Next

Download Interview Questions asked by top MNCs in 2019?

"0 Responses on Programming with RDD in Spark"

    100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.

    Sales Offer

    Sign Up or Login to view the Free Programming with RDD in Spark.