我在尝试使用 spark 从 s3 读取文件时遇到异常。错误和代码如下。该文件夹由许多名为 part-00000 part-00001 等的文件组成,这些文件来自 hadoop。它们的文件大小范围从 0kb 到几 gb
16/04/07 15:38:58 INFO NativeS3FileSystem: Opening key 'titlematching214/1.0/bypublicdemand/part-00000' for reading at position '0' 16/04/07 15:38:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/titlematching214%2F1.0%2Fbypublicdemand%2Fpart-00000' XML Error Message:
InvalidRange
The requested range is not satisfiablebytes=0-01AED523DF401F17ECBYUH1h3WkC7/g8/EFE/YyHbzxoNTpRBiX6QMy2RXHur17lYTZXd7XxOWivmqIpu0F7Xx5zdWns=
object ReadMatches extends App{
override def main(args: Array[String]): Unit = {
val config = new SparkConf().setAppName("RunAll").setMaster("local")
val sc = new SparkContext(config)
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem")
hadoopConf.set("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "myRealKeyId")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "realKey")
val sqlConext = new SQLContext(sc)
val datset = sc.textFile("s3n://altvissparkoutput/titlematching214/1.0/*/*")
val ebayRaw = sqlConext.read.json(datset)
val data = ebayRaw.first();
}
}
最佳答案
也许您可以直接从 s3
读取您的数据集。
val datset = "s3n://altvissparkoutput/titlematching214/1.0/*/*"
val ebayRaw = sqlConext.read.json(datset)
关于scala - 读取 s3 存储桶时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36479624/