Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (55.6k points)

Can anyone tell me the methods to create RDD in spark?

1 Answer

0 votes
by (119k points)

In Spark, RDD can be created using parallelizing, referencing an external dataset, or creating another RDD from an existing RDD.

Here is an example how to create a RDD using Parallelize() method:

from pyspark import SparkContext

words = spark.sparkContext.parallelize ( ["Spark", "is", "easy", "and", "awesome"])

count_words = words.count ( )

print("Number of elements in RDD”, (count_words))

Here is an example to load a dataset and create RDD:

val dataRDD = spark.read.csv("path_of_csv/file").rdd #For csv file

val dataRDD = spark.read.json("path_of_json/file").rdd #For json file

val dataRDD = spark.read.textFile("path_of_text/file").rdd #For text file

Here is an example to create an RDD from an existing RDD:

val rdd1=spark.sparkContext.parallelize(Seq( ["Spark", "is", "easy", "and", "awesome"])

val rdd_new= rdd1.map(w => (w.charAt(0), w))

rdd_new.foreach(println)

If you are interested in to learn Spark, I recommend this Spark Certification by Intellipaat.

You can watch this video to understand more about creating Spark RDD:

Browse Categories

...