I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD: val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation?