Spark DataFrame: count distinct values of every column

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-26T04:53:06+0000

In pySpark, use countDistinct() and do something like this:

from pyspark.sql.functions import col, countDistinct
df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col
df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

Another approach would be to use approxCountDistinct() that will help you to speed things up at the potential loss of accuracy:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+

Note that approx_count_distinct method relies on HyperLogLog under the hood.

If you wish to learn Spark visit this Spark Tutorial.

Spark DataFrame: count distinct values of every column

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources