0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.

Here is the default Spark behavior.

val numbersDf = Seq(
  ("123"),
  ("456"),
  (null),
  ("")
).toDF("numbers")

val lettersDf = Seq(
  ("123", "abc"),
  ("456", "def"),
  (null, "zzz"),
  ("", "hhh")
).toDF("numbers", "letters")

val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))


Here is the output of joinedDf.show():

+-------+-------+
|numbers|letters|
+-------+-------+
|    123|    abc|
|    456|    def|
|       |    hhh|
+-------+-------+


This is the output I would like:

+-------+-------+
|numbers|letters|
+-------+-------+
|    123|    abc|
|    456|    def|
|       |    hhh|
|   null|    zzz|
+-------+-------+

1 Answer

0 votes
by (32.2k points)
edited by

I would suggest you use a NULL-safe equal operator(<=>).

I tried the approach and it worked.

image

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !

Categories

...