hadoop - 从 s3 ://to local file system 复制文件

标签 hadoop amazon-web-services amazon-s3 apache-pig hdfs

我是 aws 新手。我创建了一个集群并通过 ssh 连接到主节点。当我尝试将文件从 s3://my-bucket-name/复制到 pig 中的本地 file://home/hadoop 文件夹时,使用:

cp s3://my-bucket-name/path/to/file file://home/hadoop

我得到错误:

2013-06-08 18:59:00,267 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 99: Unexpected internal error. AWS Access Key ID and Secret Access Key must be s pecified as the username or password (respectively) of a s3 URL, or by setting t he fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

我什至无法进入我的 s3 存储桶。我设置了 AWS_ACCESS_KEY 和 AWS_SECRET_KEY 但没有成功。我也找不到 pig 的配置文件来设置适当的字段。

有什么帮助吗?

编辑: 我尝试使用完整的 s3n://uri 在 pig 中加载文件

grunt> raw_logs = LOAD 's3://XXXXX/input/access_log_1' USING TextLoader a
s (line:chararray);
grunt> illustrate raw_logs;

我收到以下错误:

2013-06-08 19:28:33,342 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-06-08 19:28:33,404 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? fal se 2013-06-08 19:28:33,404 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2013-06-08 19:28:33,405 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2013-06-08 19:28:33,405 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2013-06-08 19:28:33,429 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percen t is not set, set to default 0.3 2013-06-08 19:28:33,430 [main] ERROR org.apache.pig.pen.ExampleGenerator - Error reading data. Internal error creating job configuration. java.lang.RuntimeException: Internal error creating job configuration. at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java :160) at org.apache.pig.PigServer.getExamples(PigServer.java:1244) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser. java:722) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigS criptParser.java:591) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript Parser.java:306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:165) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:500) at org.apache.pig.Main.main(Main.java:114) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) 2013-06-08 19:28:33,432 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 97: Encountered IOException. Exception : Internal error creating job configurati on. Details at logfile: /home/hadoop/pig_1370719069857.log

最佳答案

首先,您应该使用 s3n 协议(protocol)(除非您使用 s3 协议(protocol)将文件存储在 s3 上)- s3 用于 block 存储(即类似于 hdfs,仅在 s3 上),s3n 用于 native s3 文件系统(即你得到你在那里看到的)。

您可以使用 distcp 或来自 s3n 的简单 pig 加载。您可以在 hadoop-site.xml 中提供访问和 secret ,如您在异常中指定的那样(有关更多信息,请参见此处:http://wiki.apache.org/hadoop/AmazonS3),或者您可以将它们添加到 uri:

raw_logs = LOAD 's3n://access:secret@XXXXX/input/access_log_1' USING TextLoader AS (line:chararray);

确保您的密码不包含反斜杠 - 否则它不会起作用。

关于hadoop - 从 s3 ://to local file system 复制文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17002866/

相关文章:

express - 从 express 的公共(public)文件夹中提供 git-lfs 文件

amazon-web-services - 创建 AWS/IAM 策略时获取 'The policy failed legacy parsing'

amazon-web-services - 如何使用AWS CDK添加S3 BucketPolicy?

types - Hadoop 流式传输示例失败 映射中的键类型不匹配

apache-spark - Spark 性能从 Dataframe 保存到 hdfs 或 hive 的大型数据集

hadoop - oozie Sqoop 操作无法将数据导入配置单元

amazon-web-services - 设置 'maxActiveInstances' 错误

scala - 如何在spark中为输入文件定义多个自定义分隔符?

amazon-web-services - Lambda Resource-Based Policy 中 principal 和 source-account 的区别

amazon-web-services - Terraform for 循环