python - 阶段失败时的 Spark FileAlreadyExistsException

标签 python dataframe apache-spark amazon-s3 pyspark

我正在尝试在重新分区后将数据帧写入 s3 位置。但是,每当写入阶段失败并且 Spark 重试该阶段时,它就会抛出 FileAlreadyExistsException。

当我重新提交作业时,如果 spark 一次完成该阶段,它工作正常。

下面是我的代码块

df.repartition(<some-value>).write.format("orc").option("compression", "zlib").mode("Overwrite").save(path)

我相信 Spark 应该在重试之前从失败的阶段删除文件。我知道如果我们将重试设置为零,这将得到解决,但 Spark 阶段预计会失败,这不是一个合适的解决方案。

错误如下:

Job aborted due to stage failure: Task 0 in stage 6.1 failed 4 times, most recent failure: Lost task 0.3 in stage 6.1 (TID 740, ip-address, executor 170): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://<bucket-name>/<path-to-object>/part-00000-c3c40a57-7a50-41da-9ce2-555753cab63a-c000.zlib.orc
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:242)
    at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)
    at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:170)
    at org.apache.orc.OrcFile.createWriter(OrcFile.java:843)
    at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)
    at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

我正在将 Spark 2.4 与 EMR 结合使用,请提出解决方案。

编辑 1: 请注意这个问题与覆盖模式无关,我已经在使用它了。正如问题标题所暗示的那样,问题在于阶段失败时剩余的文件。可能是 Spark UI 清除了它。 enter image description here

最佳答案

在您的 Spark 配置中设置 spark.hadoop.orc.overwrite.output.file=true

您可以在此处找到有关此配置的更多详细信息 - OrcConf.java

关于python - 阶段失败时的 Spark FileAlreadyExistsException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57471781/

相关文章:

apache-spark - 如何在 Spark 中过滤具有特定条件的数据帧

apache-spark - 即使在设置了相当长的超时值 1000 秒后,Spark 执行程序也会因超时而丢失

python - 将大型 Parquet 文件转换为 csv

Python - 一个字符串列在另一列中吗?

python - 具有相同变量的多个循环

r - 如何 "round"ggplot中的范围线

java - 如何以编程方式在执行程序节点中查找 Spark 版本?

python - 如何将 Numpy 数组切片到边界?

python - 如何在groupby期间将agg函数中的日期字符串转换为日期时间

python - 基于 Pandas.DataFrame 创建类使用 pandas.read_csv() 函数初始化