java - 级联加入两个文件很慢

我正在使用级联对两个 300MB 的文件进行 HashJoin。我执行以下级联工作流程:

// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );   

// select the fields which I need from the second file 
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS ); 

// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"), 
                     docPipe2, new Fields("id_2"), new LeftJoin());

// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );

我在 Hadoop 集群上运行这个级联程序，该集群有 3 个数据节点，每个节点有 8 个 RAM 和 4 个内核(我将 mapred.child.java.opts 设置为 4096MB。)；但我需要大约 30 分钟才能得到最终结果。我觉得太慢了，但是我觉得我的程序和集群都没有问题。我怎样才能使这种级联连接更快？

最佳答案

在级联用户指南中给出

HashJoin 尝试将整个右侧流保留在内存中以进行快速比较(不仅仅是当前分组，因为 HashJoin 不执行任何分组)。因此右侧流中有一个非常大的元组流可能会超过可配置的溢出到磁盘阈值，从而降低性能并可能导致内存错误。因此，建议使用右侧较小的流。

或

使用可能有用的 CoGroup

关于java - 级联加入两个文件很慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20433003/

上一篇：java - Hadoop 2.x 中的分布式缓存

下一篇：variables - 在 HIVE 中增加现有的 row_sequence

java - ConcurrentModificationException 的可能原因

java - 为什么原始数组是用另一种方法修改的？

hadoop - 使用 pig 按列合并两个文件

hadoop - 谁将有机会先执行，Combiner 还是 Partitioner？

javascript - knockout "with"绑定(bind)、级联下拉、重新加载选定值不起作用

java - 具有 spring security 的应用程序之间是否共享 SecurityContextHolder

python - 在 hadoop 集群上运行时出现 MRJob 错误

java - 使用org.apache.hadoop DistributedFileSystem时，线程 “main” java.lang.NullPointerException中的异常

hadoop - Clojure Hadoop - 5 行 Cascalog 相当于 300 行 PIG？