scala - 读取 s3 存储桶时出错

标签 scala hadoop amazon-web-services amazon-s3 apache-spark

我在尝试使用 spark 从 s3 读取文件时遇到异常。错误和代码如下。该文件夹由许多名为 part-00000 part-00001 等的文件组成,这些文件来自 hadoop。它们的文件大小范围从 0kb 到几 gb

16/04/07 15:38:58 INFO NativeS3FileSystem: Opening key 'titlematching214/1.0/bypublicdemand/part-00000' for reading at position '0' 16/04/07 15:38:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/titlematching214%2F1.0%2Fbypublicdemand%2Fpart-00000' XML Error Message: InvalidRangeThe requested range is not satisfiablebytes=0-01AED523DF401F17ECBYUH1h3WkC7/g8/EFE/YyHbzxoNTpRBiX6QMy2RXHur17lYTZXd7XxOWivmqIpu0F7Xx5zdWns=

object ReadMatches extends App{
  override def main(args: Array[String]): Unit = {
    val config = new SparkConf().setAppName("RunAll").setMaster("local")
    val sc = new SparkContext(config)
    val hadoopConf = sc.hadoopConfiguration
    hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem")
    hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem")
    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myRealKeyId")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "realKey")
    val sqlConext = new SQLContext(sc)

    val datset = sc.textFile("s3n://altvissparkoutput/titlematching214/1.0/*/*")
    val ebayRaw = sqlConext.read.json(datset)
    val data = ebayRaw.first();
  }
}

最佳答案

也许您可以直接从 s3 读取您的数据集。

    val datset = "s3n://altvissparkoutput/titlematching214/1.0/*/*"
    val ebayRaw = sqlConext.read.json(datset)

关于scala - 读取 s3 存储桶时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36479624/

相关文章:

scala - 限制 Play 2.1 框架中的线程数

scala - Scala的shebang行是什么,它不会破坏mimetype?

斯卡拉 2.13,SBT : sbt compile uses wrong compiler version

amazon-web-services - 错误 : InvalidProfileError - The config profile (default) could not be found despite having config file in place

amazon-web-services - AWS 经典 LB 更改 IP/断开连接导致 RabbitMQ 上的消息丢失

performance - 这是 Scala 2.9.1 延迟实现中的错误还是只是反编译的产物

hadoop - 如何修复 Spark Streaming 中的数据局部性?

hadoop - Apache PIG : apply LIMIT only if parameter is > 0

amazon-web-services - Amazon S3 存储桶命名约定

hadoop - hadoop流-file选项以传递多个文件