我处理具有两列 mvv 和 count 的数据框。
+---+-----+
|mvv|count|
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
我想获得两个包含 mvv 值和计数值的列表。类似的东西
mvv = [1,2,3,4]
count = [5,9,3,1]
所以,我尝试了以下代码:第一行应该返回一个 python 行列表。我想查看第一个值:
mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)
但我收到第二行的错误消息:
AttributeError: getInt
最佳答案
看看,为什么你的这种方式行不通。首先,您试图从 Row 中获取整数。类型,你收集的输出是这样的:
>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
如果你采取这样的方式:
>>> firstvalue = mvv_list[0].mvv
Out: 1
您将获得 mvv
值。如果你想要数组的所有信息,你可以这样:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
但是,如果您对另一列尝试相同的操作,您会得到:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
这是因为 count
是一个内置方法。该列与 count
同名。解决方法是将 count
的列名更改为 _count
:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
但不需要此解决方法,因为您可以使用字典语法访问该列:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
它最终会奏效!
关于python - 将 spark DataFrame 列转换为 python 列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38610559/