scala - Spark 中的 FileNotFound 错误

标签 scala hadoop apache-spark hdfs

我在集群上运行一个简单的 spark 程序:

val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()

println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println() 
println()
println()

我得到以下错误

 15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on      
 executor 192.168.0.19: java.io.FileNotFoundException (File   
 file:/home/hduser/README.md does not exist.) [duplicate 6]
 15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;    
 aborting job
 15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)   
 on executor 192.168.0.19: java.io.FileNotFoundException (File   
 file:/home/hduser/README.md does not exist.) [duplicate 7]
 15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks   
 have all completed, from pool 
 15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
 15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at  
 SimpleApp.scala:55) failed in 7.636 s
 15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at  
 SimpleApp.scala:55, took 7.810387 s
 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

文件在正确的位置。如果我将 README.MD 替换为 REAMDME.TXT,它将正常工作。有人可以帮忙吗?

谢谢

最佳答案

如果您正在运行多节点集群,请确保所有节点都在相同的给定路径中拥有关于它们自己的文件系统的文件。或者,您知道,只需使用 HDFS。

在多节点情况下,"/home/hduser/README.md" 路径也会分发到工作节点。 README.md 可能只存在于主节点上。现在,当工作人员尝试访问此文件时,他们不会查看 master 的文件系统,而是每个人都会尝试在自己的文件系统中找到它。如果您在每个节点的相同路径上都有相同的文件。该代码很可能会起作用。为此,使用相同路径将文件复制到每个节点的文件系统。

您已经注意到,上面的解决方案很麻烦。 Hadoop FS、HDFS 解决了这个问题等等。你应该调查一下。

关于scala - Spark 中的 FileNotFound 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33375451/

相关文章:

python - 'local[n]' pyspark 应用程序是否受 GIL 影响?

scala - sbt 更新未解析存储库中的最新工件

scala - 光滑的 3.0.0 : How to query one-to-many/many-to-many relations

scala - 我可以使用 monad 转换器来简化这个组合吗?

hadoop - "Hello World !"用于 hadoop/hbase?

scala - Spark Job - Scala 对象按 FIFO 顺序读取,如何使其并行?

apache-spark - Spark ElasticSearch EsHadoopIllegalArgumentException 无法找到具有有效 URI 的 keystore

scala - 将元素放入流中并返回一个对象

java - Hadoop - 在哪里放置全局 Java 标志

Hadoop MultipleOutputs 输出文件 "part-day-26"