RDD i.e. Resilient Distributed Datasets is an immutable distributed collection of objects. It is fundamental data structure of Spark. There are logical partitions for each dataset in RDD, which can be computed on different nodes of the cluster. RDDs can contain objects of any type Python, Java, or Scala, also it can include user-defined classes.
An RDD is a read-only, partitioned collection of records. RDD is a collection of elements which is fault-tolerant and can be operated on in parallel.
There are two ways to create RDDs −
1. Parallelizing an existing collection in your driver program, or
2. Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD. RDD makes Spark achieve faster and efficient MapReduce operations.