Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.

Let's say I create an RDD: val rdd = sc.textFile(file)

Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?

Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example: => x / rdd.size)

Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation?

1 Answer

0 votes
by (32.3k points)

The file or parts of the file(in case the file is too big) is usually replicated to N number of nodes in the cluster (by default on HDFS N=3). By default it is not intended to split every file between all available nodes.

However, for a client working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.

For reference I would strongly suggest you to read this article on RDD-internals:

Browse Categories