python - 减少(键，值)，其中值是 Spark 中的字典

标签 python apache-spark pyspark apache-spark-sql mapreduce

我是 Spark 新手，仍在学习中。我的问题是我正在使用 map 函数创建 (key, dict) 形式的 Rdd，如下所示 [(0, {'f_0':'-0.5'} ), (0, {'f_1':'-0.67'}), (1, {'f_0':'-0.36'}), (1, {'f_1':'-1.5'})]

所需的按键缩减形式应为: [(0, {'f_0':'-0.5','f_1':'-0.67'}), (1, {'f_0':'-0.36', 'f_1':'-1.5'} )]

我正在使用 pyspark，databricks 上的 python

有人可以帮忙吗？

最佳答案

根据您的问题，您的 map 函数输出是:

df = spark.createDataFrame([
  (0, {'f_0':-0.5}), 
  (0, {'f_1':-0.67}), 
  (1, {'f_0':-0.36}), 
  (1, {'f_1':-1.5})], ["key", "val"])

使用下面的代码和reduceByKey来获得您想要的输出:

df.rdd.reduceByKey(lambda a,b:{**a,**b})

请注意，上面的代码将在 python3 版本中运行，而不是在 python2 版本中运行。因此 pyspark python 版本应该是 3.5 或更高。

如果您的 pyspark python 版本是 2.7，则使用以下代码:

def merge_two_dicts(x, y):
    z = x.copy()   
    z.update(y)    

    return z

merge= df.rdd.reduceByKey(merge_two_dicts)

关于python - 减少(键，值)，其中值是 Spark 中的字典，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57937045/

上一篇：python - Python 迭代工具

下一篇：python - 在 python3 中使用 selenium 无法在页面中找到元素

相关文章：

apache-spark - pyspark ml 推荐 - Als 推荐

python - 在 Spark 中应用具有非恒定帧大小的窗口函数

python - 更改 tkinter 中按钮的颜色适用于 Windows，但不适用于 Mac OSX

python - 为什么包络曲线一开始就是错误的？

java - Apache Spark : How to structure code of a Spark Application (especially when using Broadcasts)

java - 复制数据集中的行并更改值

Python - 骰子模拟器

python - 从 mysql 的列中检索数据

scala - 在 Spark 作业中写入 HBase : a conundrum with existential types

apache-spark - 使用每个分区中的 _SUCCESS 文件将分区数据集写入 HDFS/S3