来自没有 Pandas 的 Python 字典的 PySpark 数据帧

我正在尝试转换以下 Python dict进入 PySpark DataFrame 但我没有得到预期的输出。

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF()  # Result not as expected
df_dict.show()

有没有办法在不使用 Pandas 的情况下做到这一点？

最佳答案

报价myself :

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

所以最简单的就是把你的字典转换成这种格式。您可以使用 zip() 轻松完成此操作:

column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#+-------+-------+

以上假设所有列表的长度相同。如果不是这种情况，您将不得不使用 itertools.izip_longest (python2) 或 itertools.zip_longest ( python 3)。

from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3

dict_lst = {'letters': ['a', 'b', 'c'], 
             'numbers': [10, 20, 30, 40]}

column_names, data = zip(*dict_lst.items())

spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#|      a|     10|
#|      b|     20|
#|      c|     30|
#|   null|     40|
#+-------+-------+

关于来自没有 Pandas 的 Python 字典的 PySpark 数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51554921/

上一篇：sql - 如何创建到 SQL Server 的 ODBC 连接？

下一篇：vb.net - 将事件添加到项目中的所有表单

python - 在 pyspark 中对列表中的不同数据框列求和的正确方法是什么？

apache-spark - 用户在使用 spark.sql 读取数据时没有 ALTERTABLE_ADDCOLS 权限

python - PySpark:列的绝对值。类型错误:需要 float

python - 如何访问 Spark RandomForest 中的单个预测？

python-3.x - 混淆矩阵获得精度、召回率、f1 分数

apache-spark - 使用列值作为spark DataFrame函数的参数

python - 如何在 PySpark 中将字典转换为数据框？

docker - 从 Docker 容器将 PySpark 连接到 Kafka

apache-spark - Spark 从不同的模式目录中读取数据帧