假设我有以下 RDD:
a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])
如何将这 2 个 RDD 合并为一个 RDD,如下所示:
[('a', 1), ('c', 2), ('d', 5), ('e', 3)]
使用 a.union(b)
只是将它们组合在一个列表中。有什么想法吗?
最佳答案
您可能只想b.zip(a)
两个 RDD(请注意相反的顺序,因为您希望按 b 的值进行键控)。
只需阅读 python docs小心:
zip(other)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]
关于apache-spark - 在 pyspark 中合并两个 RDD,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35085627/