7Using RDD in Spark

Working with Key Value Pairs


Spark provides special type of operations on RDDs containing key or value pairs. These RDDs are called pair RDDs operations. Pair RDDs are a useful building block in many programming language, as they expose operations that allow you to act on each key operations in parallel or regroup data across the network.

Creating Pair RDDs

Pair RDDs can be created by running a map() function that returns key or value pairs. The procedure to build the key-value RDDs differs by language. In Python language, for the functions on keyed data to work we need to return an RDD composed of tuples Creating a pair RDD using the first word as the key in Python programming language.

pairs = lines.map(lambda x: (x.split(" ")[0], x))

In Scala, for the functions on keyed data to be available, we also need to return tuples as shown in the previous example. An implicit conversion on RDDs of tuples exists to provide the additional key or value functions as per the requirement.

Creating a pair RDD using the first word as the key word in Scala

val pairs = lines.map(x => (x.split(" ")(0), x))

Java doesn’t have a built-in function of tuple function, so only Spark’s Java API has users create tuples using the scala.Tuple2 class. Java users can construct a new tuple by writing new Tuple2(elem1, elem2) and can then access its relevant elements with the _1() and _2() methods.

Java users also need to call special versions of Spark’s functions when you are creating pair of RDDs. For instance, the mapToPair () function should be used in place of the basic map() function.

Creating a pair RDD using the first word as the key word in Java program.

PairFunction<String, String, String> keyData =
new PairFunction<String, String, String>() {
public Tuple2<String, String> call(String x) {
return new Tuple2(x.split(" ")[0], x);
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);

Learn about Apache Spark from Big Data & Spark Training Course and excel in your career as a an Apache Spark Specialist.

Transformations on Pair RDDs


When datasets are described in terms of key or value pairs, it is common feature that is required to aggregate statistics across all elements with the same key value. Spark has a set of operations that combines values that own the same key value. These operations return RDDs and thus are transformations rather than actions i.e.

  • reduceByKey(),
  • foldByKey(),
  • combineByKey().

Grouping Data

With key data is a common type of use case in grouping our data sets is used with respect to predefined key value for example, viewing all of a customer’s orders together in one file.

If our data is already keyed in the way we want to implement, groupByKey() will group our data using the key value using our RDD. On an RDD consisting of keys of type K and values of type V, we get back an RDD operation of type [K, Iterable[V]].

groupBy() works on unpaired data or data where we want to use a different terms of condition besides equality on the current key been specified. It requires  a function that it allows to apply the same to every element in the source of RDD and uses the result to determine the key value obtained.


The most useful and effective operations we get with keyed data values comes from using it together with other keyed data. Joining datasets together is probably one of the most common type of operations you can find out on a pair RDD.

  • Inner Join : Only keys that are present in both pair RDDs are known as the output.
  • leftOuterJoin() : The resulting pair RDD has entries for each key in the source RDD. The value which is been associated with each key in the result is a tuple of the value from the source RDD and an Option for the value from the other pair of RDD.
  • rightOuterJoin() : is almost identical functioning to leftOuterJoin () except the key must be present in the other RDD and the tuple has an option for the source rather than the other RDD functions.

Want to grab a detailed knowledge on Spark? Visit Big Data Spark Course in Toronto.

Sorting Data

We can sort an RDD with key or value pairs provided that there is an ordering defined on the key set. Once we have sorted our data elements, any subsequent call on the sorted data to collect() or save() will result in ordered dataset.

Actions Available on Pair RDDs are as follows :

  • countByKey() : Count the number of elements for each key pair.
  • collectAsMap() : Collect the result outputs as a map to provide easy lookup.
  • lookup(key) : Return all values associated with the provided key pair.

Data Partitioning with Advanced version :

In a distributed program, communication is very expensive compared to others, so laying out data to minimize network traffic can greatly improve better performance. Much similar how a single-node program structure needs to choose the right data structure for the purpose of collection of records, Spark programs can choose to control their RDDs partitioning to reduce the communication effects. Partitioning will not be helpful in all applications consider the following example, if a given RDD is scanned only one time, there is no point in partitioning the same in advance. It is useful only when a dataset is reused multiple times in key-oriented operations such as joins operation.

Get familiar with top Spark Interview Questions and Answer to get a head start in your career now!

Determining an RDD’s Partitioner :

Example :

scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
pairs: spark.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> pairs.partitioner
res0: Option[spark.Partitioner] = None
scala> val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))
partitioned: spark.RDD[(Int, Int)] = ShuffledRDD[1] at partitionBy at <console>:14
scala> partitioned.partitioner
res1: Option[spark.Partitioner = Some([email protected])

In this short session, we trying to create an RDD operations of (Int, Int) pairs, which initially have no partitioning information consisting of an Option with value None. We then created a second thing of RDD by hash-partitioning the first. If we actually wanted to make use of partitioned in further operations, then we should have appended persist() to the third line of input files, in which partitioned is defined in an convenient manner. This is for the same reason that we required the persist () for userData in the previous example: without persist(), subsequent RDD actions will evaluate the entire lineage of partitioned information, which will cause pairs to be hash-partitioned over and over again.

Operations That Benefit from Partitioning

The operations that benefit from partitioning are

  • cogroup()
  • groupWith()
  • join()
  • leftOuterJoin()
  • rightOuterJoin()
  • groupByKey()
  • reduceByKey()
  • combineByKey(), and
  • lookup().

Become Master of Spark by going through this online Big Data & Spark Training in Sydney.

Operations That Affect Partitioning

All the operations that result in a partitioned being set on the output result of RDD:

  • cogroup(),
  • groupWith(),
  • join(),
  • leftOuterJoin(),
  • rightOuterJoin(),
  • groupByKey(),
  • reduceByKey(),
  • combineByKey(),
  • partitionBy(),
  • sort(),
  • mapValues() (if the parent RDD has a partitioner),
  • flatMapValues() (if parent has a partitioner), and
  • filter() (if parent has a partitioner).

Custom Partitioners

To implement custom partitioner methods, you need to subclass the org.apache.spark.Partitioner class and implement three types of methods:

  • NumPartitions : Int, which returns the number of partitions that you will create it as per the requirements.
  • getPartition(key: Any) : Int, which returns the partition ID ranging from (0 to numPartitions-1) for a given key.
  • equals(), : the standard Java programming equality method. This is important to implement because Spark application will need to test your Partitioner object against other instances by its own terms when it decides whether two of your RDDs are partitioned the same way as it is required.

If you have any doubts or queries related to Apache Spark, do post on Big Data Hadoop & Spark Community.

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
2 + 25 =