Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?

1 Answer

0 votes
by (32.3k points)

In pySpark, use countDistinct() and do something like this:

from pyspark.sql.functions import col, countDistinct

df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct

import org.apache.spark.sql.functions.col => countDistinct(col(c)).alias(c)): _*)

Another approach would be to use approxCountDistinct() that will help you to speed things up at the potential loss of accuracy:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")

val exprs = -> "approx_count_distinct")).toMap


// +---------------------------+---------------------------+---------------------------+

// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|

// +---------------------------+---------------------------+---------------------------+

// |                          2| 2|                          3|

// +---------------------------+---------------------------+---------------------------+

Note that approx_count_distinct method relies on HyperLogLog under the hood.

If you wish to learn Spark visit this Spark Tutorial.

Browse Categories