Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am trying to build for each of my users a vector containing the average number of records per hour of day. Hence the vector has to have 24 dimensions.

My original DataFrame has userID and hour columns, and I am starting by doing a groupBy and counting the number of record per user per hour as follow:

val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")


Now, in order to generate a vector per user I am doing this:

val hours = (0 to 23 map { n => s"$n" } toArray)

val assembler = new VectorAssembler()
                     .setInputCols(hours)
                     .setOutputCol("hourlyConnections")

val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))

val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
                           .agg(exprs.head, exprs.tail: _*))


When I run this, I get the following warning:

Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
 

I presume this is because the expression is too long?

My question is: can I safely ignore this warning?

1 Answer

0 votes
by (32.3k points)

I would say if you are not interested in seeing the sql schema logs, you can safely ignore this warning. Otherwise, you might want to set the property to a higher value, but it might affect the performance of your job:

spark.debug.maxToStringFields=100

Default value is: DEFAULT_MAX_TO_STRING_FIELDS = 25


The performance overhead of creating and logging strings can be set to a very large value for wide schemas. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.

Browse Categories

...