Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried RDD.count() but it takes a lot of time. I've seen that i can use the function fold. But i didn't found a java documentation of this function. Could you please show me how to use it or show me another solution to get the number of rows of my RDD.

Here is my code :

JavaPairRDD<String, String> lines = getAllCustomers(sc).cache();
JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache();
JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache();

double count_ctid = (double)join.count(); // i want to get the count of these three RDD
double all = (double)lines.count();
double count_cfid = all - CFIDNotNull.count();
System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%");

1 Answer

0 votes
by (32.3k points)

As far as my experience rdd.count() is the best way to count the number of rows or to count no. of elements in an RDD. There is no faster way.

I think as this is the only best option your main problem is with the count() being so slow.

So, let me break it down for you.

rdd.count() is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count() were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, they were just queued up. So, when you actually call count(), you force all the previous lazy operations to be performed. The input files need to be loaded now, map()s and filter()s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has.

Keep in mind that if you call count() twice, all this will happen for one more time. After the count is returned, all the data is discarded! Now, if you are willing to avoid this, call cache() on the RDD. 

Also, I would suggest you to try rdd.countApprox() just as an experiment. CountApprox is nothing but just an approximate version of count(). It returns a potentially incomplete result within a timeout, even if not all tasks have finished.

It may be faster in the first attempt but when you try it for the second time its of no use it is even gets slower than count() many times.

Browse Categories