该过程正在将文件从一个 hdfs 位置复制到 SAME 集群中的另一个位置。这工作正常,但 hadoop -cp 需要时间。对于同一个集群,它可以替换为 distcp 吗?或者是否有更好的解决方案来提高性能。
最佳答案
根据文档,distcp 还可以在集群内以及集群之间复制数据:
https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying. (...) The most common invocation of DistCp is an inter-cluster copy:
bash$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each NodeManager from nn1 to nn2.
关于Hadoop 用 Distcp 替换 cp,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47647717/