Explore Courses Blog Tutorials Interview Questions
+1 vote
in Big Data Hadoop & Spark by (1.1k points)

Definition says:

RDD is immutable distributed collection of objects

I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)

From this link: It mentions:

Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program

I am really confused understanding RDD in general and in relation to spark and hadoop.

Can some one please help.

2 Answers

0 votes
by (13.2k points)

RDD i.e. Resilient Distributed Datasets is an immutable distributed collection of objects. It is fundamental data structure of Spark. There are logical partitions for each dataset in RDD, which can be computed on different nodes of the cluster. RDDs can contain  objects of any type Python, Java, or Scala, also it can include user-defined classes.

An RDD is a read-only, partitioned collection of records. RDD is a collection of elements which is  fault-tolerant and can be operated on in parallel.

There are two ways to create RDDs −

1. Parallelizing an existing collection in your driver program, or

2. Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD. RDD makes Spark achieve faster and efficient MapReduce operations.

+1 vote
by (33.1k points)
edited by

An RDD is a logical reference of a dataset that is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.

A dataset consists of the data loaded externally by the user. It could be a JSON file, CSV file or a text file with no specific data structure.

Hope this answer helps you!

For more information regarding the same, refer to the following video:

Browse Categories