Can anyone tell me the need for RDD in Spark?

1 Answer

RDD (Resilient Distributed Dataset) is a basic data structure used in Spark to execute the MapReduce operations faster and efficiently.

Data sharing in MapReduce take a lot of time because of replication, serialization, and disk IO. Hadoop applications take over 90 percent of the time in read-write operations. So, researchers came up with this RDD concept that uses in-memory processing computation. Using RDDs increased the data sharing in memory by 10 to 100 times faster than network and disk.

