我想更多地了解 pyspark 如何对数据进行分区。我需要一个这样的函数:
a = sc.parallelize(range(10), 5)
show_partitions(a)
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] (or however it partitions)
最佳答案
glom功能就是您正在寻找的:
glom(self): Return an RDD created by coalescing all elements within each partition into a list.
a = sc.parallelize(range(10), 5)
a.glom().collect()
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
关于pyspark - 如何在pyspark中查看RDD中每个分区的内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34216390/