Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am trying to build for each of my users a vector containing the average number of records per hour of day. Hence the vector has to have 24 dimensions.

My original DataFrame has userID and hour columns, and I am starting by doing a groupBy and counting the number of record per user per hour as follow:

val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")


Now, in order to generate a vector per user I am doing this:

val hours = (0 to 23 map { n => s"$n" } toArray)

val assembler = new VectorAssembler()
                     .setInputCols(hours)
                     .setOutputCol("hourlyConnections")

val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))

val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
                           .agg(exprs.head, exprs.tail: _*))


When I run this, I get the following warning:

Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
 

I presume this is because the expression is too long?

My question is: can I safely ignore this warning?

3 Answers

0 votes
by (32.3k points)

I would say if you are not interested in seeing the sql schema logs, you can safely ignore this warning. Otherwise, you might want to set the property to a higher value, but it might affect the performance of your job:

spark.debug.maxToStringFields=100

Default value is: DEFAULT_MAX_TO_STRING_FIELDS = 25


The performance overhead of creating and logging strings can be set to a very large value for wide schemas. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.

0 votes
by (37.3k points)

You can ignore the warnings usually if your code is working well and fine. It is just a message regarding Spark’s debug output being too large.

To avoid the warning: Set a higher range or limit for debug output in Spark:

spark.debug.maxToStringFields=100

This should be helpful for avoiding the warnings and providing more detailed debug information if needed. 

0 votes
by (1.1k points)

People always seem to get this error, "Truncated the string representation of a plan since it was too large," when the logical plan or even the DataFrame is too big to be printed in the logs, as it is normal in Apache Spark. This normally happens when there are many columns or there are complex transformations involved.

Can I Ignore this Warning?

  • Effects on Performance: The warning does not affect your job execution. It is simply a log message, and does not say that there is an error or a failure which needs to be fixed. Your transformations and aggregations are being performed as they should even though the warning appears.

  • Understanding the Plan: In case you want to fix something or you just want to understand better how the execution plan looks like, try simplifying the DataFrame or use explain() with fewer columns, so that limiting can be avoided in the first place when checking the plan.

  • Changing the Limit: In the event you want to see even more information about the plan in the log, the limit of number of fields displayed in the plan can be increased by adding the following in your Spark session or job: spark.conf.set("spark.debug.maxToStringFields", "1000") // or any number greater than the default 

Conclusion

Do not be worried about this message if you manage to complete the job and get results that you are happy about. In case, however, you still wish to delve deeper into your transformations, raise the spark.debug.maxToStringFields limit for more clarity.

1.2k questions

2.7k answers

501 comments

693 users

Browse Categories

...