hadoop - Spark无法读取本地文件

标签 hadoop apache-spark file-permissions emr

我在具有以下权限的EMR群集中的所有Spark节点上都有一个本地文件:

-rw-rw---- 1 test_user test_group 30 Jun 21 14:20 /tmp/foo_test

我正在使用yarn Scheduler将集群作为ec2-user运行。为了使Spark / Yarn能够访问该文件,我在所有节点上添加了test_group作为yarn用户的辅助组。
$ sudo -u yarn groups
  yarn hadoop test_group

在spark-shell中,读取文件时出现以下错误:
scala> val rdd = sc.textFile("file:///tmp/foo_test")
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0:
java.io.FileNotFoundException: /tmp/foo_test (Permission denied)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:146)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:111)
    at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:207)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:141)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:771)
    at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
    at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

如何在EMR Spark上读取具有组级别权限的文件?

最佳答案

将权限更改为770-即chmod -R 770 /tmp并重试。

关于hadoop - Spark无法读取本地文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37947403/

相关文章:

c# - 没有使用 log4net 从 .NET 应用程序写入 Windows Server 2008 磁盘的权限

hadoop - Hadoop MapFile:如何将可写值传输到其原始类?

scala - 我如何从 pyspark 访问 couchbase

.net - 保护 git 存储库中的文件

java - 如何通过sparkSession提交多个jar给worker?

apache-spark - PySpark:withColumn() 有两个条件和三个结果

PHP MVC 沙箱

hadoop - 配置单元 : reflect function

hadoop - 通过 spark.read.json() 加载时从 JSON 中删除一列

hadoop - 如何在具有多个字段的 pig 中加入两个关系