apache-spark - 在 pyspark 中合并两个 RDD

假设我有以下 RDD:

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

如何将这 2 个 RDD 合并为一个 RDD，如下所示:

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

使用 a.union(b) 只是将它们组合在一个列表中。有什么想法吗？

最佳答案

您可能只想b.zip(a) 两个 RDD(请注意相反的顺序，因为您希望按 b 的值进行键控)。

只需阅读 python docs小心:

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

关于apache-spark - 在 pyspark 中合并两个 RDD，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35085627/

上一篇：serialization - 在 Activiti BPMN 中使用非序列化对象

下一篇：handlebars.js - 每个循环内的 Handlebars 比较运算符

apache-spark - Apache Storm 与 Apache Samza 与 Apache Spark

apache-spark - 与 repartition() 一起使用时，Spark cache() 不起作用

apache-spark - PySpark:标记点 RDD 的许多功能

python - Pyspark 计算 RDD 中所有向量之间的自定义距离

apache-spark - Spark SQL - 在连接和 groupBy 后获取重复行

Scala Dataframe空检查列

apache-spark - 如何从源代码正确构建 spark 2.0，以包含 pyspark？

dataframe - 如何使用来自另一个数据帧的随机值更新 PySpark 中的数据帧？

python - 如何从 RDD[PYSPARK] 中删除重复值