Fetching distinct values on a column using Spark DataFrame

Question

asked Jul 17, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
   val applicationId = x.getAs[String](ApplicationId)
   val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
   // DO SOME TASK PER applicationId
})
preProcessedData.unpersist()

1 Answer

Amit Rawat · Answer 1 · 2019-07-17T14:35:25+0000

In order to obtain all different values in a Dataframe you can use distinct. As by the documentation it is stated that method returns another DataFrame.

def
distinct(): DataFrame
Returns a new DataFrame that contains only the unique rows from this DataFrame.

Now, you can create a UDF in order to transform each record.

For example:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")

// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct

// Define your udf. In this case a simple function is defined, but here things may get complicated.
val myTransformationUDF = udf(value => value / 10)

// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

Fetching distinct values on a column using Spark DataFrame

1 Answer

Related questions

Browse Categories