java - 在MapReduce作业的Reducer中通过Text输入值进行多次迭代

标签 java hadoop mapreduce reduce iterable

我在HDFS上有两个非常大的数据集(表)。我想在某些列上将它们加入,然后在某些列上将它们分组,然后在某些列上执行某些组功能
我的步骤是:

1- Create two jobs.

2- In the first job, in mapper read the rows of each dataset as map input value and emit join columns' values as map output key and remaining columns' values as map output value.

After mapping, the MapReduce framework performs shuffling and groups all the map output values according to map output keys.

Then, in reducer it reads each map output key and its values which man include many rows from both datasets.

What I want is to iterate through reduce input value many times so that I can perform cartesian product.

To illustrate:

Let's say for a join key x, I have 100 matches from one dataset and 200 matches from the other. It means joining them on join key x produces 100*200 = 20000 combination. I want to emit NullWritable as reduce output key and each cartesian product as reduce output value.

An example output might be:

for join key x:

From (nullWritable),(first(1),second(1))

Over (nullWritable),(first(1),second(200))

To (nullWritable),(first(100),second(200))

How can I do that?

I can iterate only once. And I could not cash the values because they dont fit into memory.

3- If I do that, I will start the second job, which takes the first job's result file as input file. In mapper, I emit group columns' values as map output key, and the remaining columns' values as map output value. Then in reducer by iterating through each key's value, I perform some functions on some columns like sum, avg, max, min.


非常感谢。

最佳答案

由于您的第一个MR作业使用join键作为映射输出键,因此您的第一个reduce程序将为每个reduce调用获取(K join_key,List 值)。您可以做的就是将值分成两个单独的列表,每个列表用于一个数据源,然后使用嵌套的for循环执行笛卡尔积。

关于java - 在MapReduce作业的Reducer中通过Text输入值进行多次迭代,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25587811/

相关文章:

hadoop - hadoop 配置中 mapred.tasktracker.tasks.maximum 的默认值是多少

java - Collections.min/max 方法的签名

hadoop - 如何垂直而不是水平拆分数据?

hadoop - 通过Mapper输出键的最左两位数字执行化径器

javascript - 文档在 mongoDB 中如何被映射但没有被减少?

hadoop - Hbase CopyTable 将不同列族中的特定列复制到新表

java - Tomcat:共享库目录如何工作?

java - 可变预期误差

java - A星探路 |六角握把

hadoop - 作业运行期间可以更改 HDFS block 大小吗?自定义拆分和变体大小