Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Is it possible in Spark to implement '.combinations' function from scala collections?

   /** Iterates over combinations.
   *
   *  @return   An Iterator which traverses the possible n-element combinations of this $coll.
   *  @example  `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)`
   */


For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.

1 Answer

0 votes
by (32.3k points)

Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2 and combinations(defined as “combs” in the below code) will create an RDD of size rdd.size() choose 2

val rdd = spark.sparkContext.parallelize(1 to 5)

val combs = rdd.cartesian(rdd).filter{ case (a,b) => a < b }

combs.collect()

Note this will only work if an ordering is defined on the elements of the list, since we use <. This one only works for choosing two but can easily be extended by making sure the relationship a < b for all a and b in the sequence.

...