apache-spark - Spark : "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression

我正在尝试为我的每个用户构建一个包含一天中每小时平均记录数的向量。因此向量必须有 24 个维度。

我原来的 DataFrame 有 userID和 hour列，我开始做 groupBy并计算每个用户每小时的记录数如下:

val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")

现在，为了为每个用户生成一个向量，我根据 this 中的第一个建议进行了以下操作。回答。

val hours = (0 to 23 map { n => s"$n" } toArray)

val assembler = new VectorAssembler()
                     .setInputCols(hours)
                     .setOutputCol("hourlyConnections")

val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))

val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
                           .agg(exprs.head, exprs.tail: _*))

当我运行这个例子时，我收到以下警告:

Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.

我想这是因为表达式太长了？

我的问题是:我可以安全地忽略此警告吗？

最佳答案

如果您对查看 sql 模式日志不感兴趣，您可以放心地忽略它。否则，您可能希望将该属性设置为更高的值，但这可能会影响您的工作性能:

spark.debug.maxToStringFields=100

默认值为:DEFAULT_MAX_TO_STRING_FIELDS = 25

The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.

取自:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L90

关于apache-spark - Spark : "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43759896/

apache-spark - Spark : "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression

上一篇：wpf - 如何将 UserControl 中的控件设为私有(private)？

下一篇：wifi - 使用 “netsh wlan set hostednetwork …”创建wifi热点，身份验证无法正常工作