python - 如何在 PySpark 中只打印 DataFrame 的某一列？

是否可以使用操作collect 或take 来仅打印DataFrame 的给定列？

这个

df.col.collect()

给出错误

TypeError: 'Column' object is not callable

还有这个:

df[df.col].take(2)

给予

pyspark.sql.utils.AnalysisException: u"filter expression 'col' of type string is not a boolean.;"

最佳答案

选择和显示:

df.select("col").show()

或select、flatMap、collect:

df.select("col").rdd.flatMap(list).collect()

括号符号 (df[df.col]) 仅用于逻辑切片和列本身 (df.col) 不是分布式数据结构，而是 SQL 表达式无法收集。

关于python - 如何在 PySpark 中只打印 DataFrame 的某一列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35913506/