hadoop - s3distcp 在 CDH4.5 上不适合我

我正在运行 CDH4.5。我试图将 distcp 用于 s3n，但自升级到 4.5 后出现问题。我正在尝试启动并运行 s3distcp，但遇到了问题。我下载了它，并正在运行这个命令:

hadoop jar /usr/lib/hadoop/lib/s3distcp.jar  --src hdfs://NN:8020/path/to/destination/folder --dest s3n://acceseKeyId:secretaccesskey@mybucket/destination/

但我收到以下错误:

INFO mapred.JobClient:  map 100% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201312042223_10889_r_000001_0, Status : FAILED
Error: java.lang.ClassNotFoundException: com.amazonaws.services.s3.AmazonS3
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.executeDownloads(CopyFilesReducer.java:209)
    at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:196)
    at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.reduce(CopyFilesReducer.java:30)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax

INFO mapred.JobClient: Job Failed: NA
13/12/12 13:55:25 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/985ffdb0-1bc8-4d00-aba6-fd9b18e905f1/tempspace
Exception in thread "main" java.lang.RuntimeException: Error running job
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1388)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568)

然后我将访问 key 和 ID 放入所有数据和名称节点上的 core-site.xml 中:

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>bippitybopityboo</value>
  </property>   
  <property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>supercalifragilisticexpialadoscious</value>
  </property>
  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>bippitybopityboo</value>
  </property>
  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>supercalifragilisticexpialadoscious</value>
  </property>

当我尝试这个时仍然遇到同样的错误:

hadoop jar /usr/lib/hadoop/lib/s3distcp.jar  --src hdfs://NN:8020/path/to/destination/folder --dest s3n://mybucket/destination/

是否有一些我应该做的配置，或者我是否遗漏了一些 jar 文件或执行不正确？

感谢您的帮助

最佳答案

遇到了同样的问题。修复是从 s3n 更改为 s3a。

# change from this
--dest s3n://acceseKeyId:secretaccesskey@mybucket/destination/
# to this
--dest s3a://acceseKeyId:secretaccesskey@mybucket/destination/

另见此处:S3 Support in Apache Hadoop

s3 已弃用 s3n 未维护，因此请继续使用 s3a。

S3A 是推荐用于 Hadoop 2.7 及更高版本的 S3 客户端

A successor to the S3 Native, s3n:// filesystem, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more. The filesystem is intended to be a replacement for/successor to S3 Native: all objects accessible from s3n:// URLs should also be accessible from s3a simply by replacing the URL schema.

Unmainteained: S3N FileSystem (URI scheme: s3n://)

S3N is the S3 Client for Hadoop 2.6 and earlier. From Hadoop 2.7+, switch to s3a

A native filesystem for reading and writing regular files on S3.With this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The S3N code is stable and widely used, but is not adding any new features (which is why it remains stable).

S3N requires a compatible version of the jets3t JAR on the classpath.

Since Hadoop 2.6, all work on S3 integration has been with S3A. S3N is not maintained except for security risks —this helps guarantee security. Most bug reports against S3N will be closed as WONTFIX and the text "use S3A". Please switch to S3A if you can -and do try it before filing bug reports against S3N.

(Deprecated) S3 Block FileSystem (URI scheme: s3://)

关于hadoop - s3distcp 在 CDH4.5 上不适合我，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20552541/

hadoop - s3distcp 在 CDH4.5 上不适合我

上一篇：hadoop - 在 pig 中加载多个文件 - 扩展

下一篇：hadoop - HQL - 如何将几个分区中的数据从一个表复制/移动到另一个表