python - pyspark:重新分区后出现 "too many values"错误

标签 python apache-spark apache-spark-sql pyspark rdd

我有一个 DataFrame(已转换为 RDD)并想重新分区,以便每个键(第一列)都有自己的分区。这就是我所做的:

# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))

但是,当我尝试将其映射回 DataFrame 或保存它时,出现此错误:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
        process()
      File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",     line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1703, in add_shuffle_key
    for k, v in iterator:
ValueError: too many values to unpack

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

更多的测试表明即使这样也会导致同样的错误: my_rdd = df.rdd.partitionBy(x) # x = 可以是 5、100 等

你们有没有遇到过这种情况。如果有,您是如何解决的?

最佳答案

partitionBy 需要一个 PairwiseRDD,它在 Python 中相当于 RDD 的长度为 2 的元组(列表),其中第一个元素是一个键第二个是一个值。

partitionFunc 获取 key 并将其映射到分区号。当您在 RDD[Row] 上使用它时,它会尝试将行解压缩为键和值,但失败了:

from pyspark.sql import Row

row = Row(1, 2, 3)
k, v = row

## Traceback (most recent call last):
##   ...
## ValueError: too many values to unpack (expected 2)

即使您提供了正确的数据来做这样的事情:

my_rdd = (df.rdd.map(lambda row: (int(row[0]), row)).partitionBy(len(keys))

这真的没有意义。在 DataFrames 的情况下,分区不是特别有意义。参见 my answerHow to define partitioning of DataFrame?更多细节。

关于python - pyspark:重新分区后出现 "too many values"错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33837408/

相关文章:

python - 如何在退出主线程之前确保队列为空

apache-spark - Spark,ML,StringIndexer:处理看不见的标签

apache-spark - Google Dataflow与Apache Storm

python - 减少对 - python

pyspark - 如何在 Pyspark 中将列表拆分为多列?

apache-spark - SparkSession读取多个文件而不是使用模式

python - 简单程序 "Syntax Error"

python - 使用 NetworkX 将图形导出到带有节点位置的 graphml

apache-spark - Spark select-explode 习惯用法是如何实现的?

python - Process.start 在 Windows 上不起作用