apache-spark - Spark : "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression

标签 apache-spark spark-dataframe

我正在尝试为我的每个用户构建一个包含一天中每小时平均记录数的向量。因此向量必须有 24 个维度。

我原来的 DataFrame 有 userIDhour列,我开始做 groupBy并计算每个用户每小时的记录数如下:

val hourFreqDF = df.groupBy("userID", "hour").agg(count("*") as "hfreq")

现在,为了为每个用户生成一个向量,我根据 this 中的第一个建议进行了以下操作。回答。
val hours = (0 to 23 map { n => s"$n" } toArray)

val assembler = new VectorAssembler()
                     .setInputCols(hours)
                     .setOutputCol("hourlyConnections")

val exprs = hours.map(c => avg(when($"hour" === c, $"hfreq").otherwise(lit(0))).alias(c))

val transformed = assembler.transform(hourFreqDF.groupBy($"userID")
                           .agg(exprs.head, exprs.tail: _*))

当我运行这个例子时,我收到以下警告:
Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.

我想这是因为表达式太长了?

我的问题是:我可以安全地忽略此警告吗?

最佳答案

如果您对查看 sql 模式日志不感兴趣,您可以放心地忽略它。否则,您可能希望将该属性设置为更高的值,但这可能会影响您的工作性能:

spark.debug.maxToStringFields=100

默认值为:DEFAULT_MAX_TO_STRING_FIELDS = 25

The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv.



取自:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L90

关于apache-spark - Spark : "Truncated the string representation of a plan since it was too large." Warning when using manually created aggregation expression,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43759896/

相关文章:

python - 如何计算pyspark中每行某些列的最大值

scala - 使用spark scala远程连接hbase

scala - 转换 Spark 数据框中的日期模式

scala - 删除 Spark 数据框中的所有重复记录

apache-spark - 使用 Apache spark java 搜索替换

scala - 如何编写 scala 单元测试来比较 spark 数据帧?

java - Spark 1.6 : How do convert an RDD generated from a Scala jar to a pyspark RDD?

apache-spark - 逻辑回归的 PySpark mllib p 值

apache-spark - 从 Spark 到雪花的连接

json - spark 如何从 JSON 推断数值类型?