我在HDFS上有两个非常大的数据集(表)。我想在某些列上将它们加入,然后在某些列上将它们分组,然后在某些列上执行某些组功能。
我的步骤是:
1- Create two jobs.
2- In the first job, in mapper read the rows of each dataset as map input value and emit join columns' values as map output key and remaining columns' values as map output value.
After mapping, the MapReduce framework performs shuffling and groups all the map output values according to map output keys.
Then, in reducer it reads each map output key and its values which man include many rows from both datasets.
What I want is to iterate through reduce input value many times so that I can perform cartesian product.
To illustrate:
Let's say for a join key x, I have 100 matches from one dataset and 200 matches from the other. It means joining them on join key x produces 100*200 = 20000 combination. I want to emit NullWritable as reduce output key and each cartesian product as reduce output value.
An example output might be:
for join key x:
From (nullWritable),(first(1),second(1))
Over (nullWritable),(first(1),second(200))
To (nullWritable),(first(100),second(200))
How can I do that?
I can iterate only once. And I could not cash the values because they dont fit into memory.
3- If I do that, I will start the second job, which takes the first job's result file as input file. In mapper, I emit group columns' values as map output key, and the remaining columns' values as map output value. Then in reducer by iterating through each key's value, I perform some functions on some columns like sum, avg, max, min.
非常感谢。
最佳答案
由于您的第一个MR作业使用join键作为映射输出键,因此您的第一个reduce程序将为每个reduce调用获取(K join_key,List
关于java - 在MapReduce作业的Reducer中通过Text输入值进行多次迭代,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25587811/