DataFrame equality in Apache Spark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-24T15:20:32+0000

Out of all the standard ways in the Apache Spark test suites most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.

Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows).

Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogroup the two resulting RDDs. Finally, do a regular aggregate and return false if any of the iterators are not equal.

DataFrame equality in Apache Spark

DataFrame equality in Apache Spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions