我的 MR 程序有一些奇怪的问题,不知道为什么它会这样工作。 也许可以给我提示有什么问题吗?
这就是我的 Mapper 函数的样子:
Integer Click_ID = 0;
public void map(LongWritable key, Text value, Context context)
throws IOException , InterruptedException
{
String line = value.toString();
String []lineArr = line.split("\t");
String nm_uv_id = lineArr[0];
String session_id = lineArr[1];
String time_stamp = lineArr[2];
String click_counter = lineArr[3];
String is_robot = lineArr[4];
Click_ID++;
String full_line = Click_ID + "\t"+ nm_uv_id +"\t"+ session_id+"\t"+time_stamp+"\t"+click_counter+"\t"+ is_robot;
context.write(new Text(session_id), new Text(full_line));
}
到目前为止,一切正常 - 当我设置Reducers的数量= 0时,我的映射器会产生预期的输出。
这是我的Reducer 的样子。我想做的是对每个可迭代的键进行两次迭代。为此,我尝试将 Iterable 的每个值缓存在单独的 ArrayList 中:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Text> cache = new ArrayList<Text>();
// first iterable
for (Text value : values) {
cache.add(value); }
//second iterable
for (Text entity : cache) {
context.write(key, entity); }
}
}
我用于 MR 的输入如下所示:
nm_uv_id_1 session_id_2 1234567891 1 is_robot_no
nm_uv_id_1 session_id_2 1234567892 2 is_robot_no
nm_uv_id_1 session_id_2 1234567893 3 is_robot_no
nm_uv_id_1 session_id_2 1234567894 3 is_robot_no
nm_uv_id_1 session_id_1 1234567895 1 is_robot_no
nm_uv_id_1 session_id_1 1234567896 2 is_robot_no
nm_uv_id_1 session_id_1 1234567897 3 is_robot_no
nm_uv_id_1 session_id_1 1234567898 4 is_robot_no
nm_uv_id_1 session_id_1 1234567899 5 is_robot_no
nm_uv_id_1 session_id_1 1234567888 6 is_robot_no
nm_uv_id_1 session_id_1 1234567890 7 is_robot_no
nm_uv_id_1 session_id_1 1234567890 8 is_robot_no
nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
nm_uv_id_1 session_id_1 1234567890 10 is_robot_no
nm_uv_id_1 session_id_3 1234567890 1 is_robot_no
nm_uv_id_2 session_id_4 1234587890 1 is_robot_no
nm_uv_id_2 session_id_4 1234587890 2 is_robot_no
nm_uv_id_2 session_id_4 1234587890 3 is_robot_no
nm_uv_id_2 session_id_4 1234587890 4 is_robot_no
nm_uv_id_2 session_id_4 1234587890 5 is_robot_no
nm_uv_id_2 session_id_4 1234587890 6 is_robot_no
nm_uv_id_2 session_id_4 1234587890 7 is_robot_no
nm_uv_id_2 session_id_4 1234587890 8 is_robot_no
nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
nm_uv_id_2 session_id_5 1234587890 1 is_robot_no
nm_uv_id_2 session_id_5 1234587890 2 is_robot_no
nm_uv_id_2 session_id_5 1234587890 3 is_robot_yes
nm_uv_id_2 session_id_5 1234587890 4 is_robot_yes
nm_uv_id_2 session_id_5 1234587890 5 is_robot_no
nm_uv_id_2 session_id_5 123457890 6 is_robot_no
但是我的输出文件如下所示:
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_1 13 nm_uv_id_1 session_id_1 1234567890 9 is_robot_no
session_id_2 2 nm_uv_id_1 session_id_2 1234567892 2 is_robot_no
session_id_2 2 nm_uv_id_1 session_id_2 1234567892 2 is_robot_no
session_id_2 2 nm_uv_id_1 session_id_2 1234567892 2 is_robot_no
session_id_2 2 nm_uv_id_1 session_id_2 1234567892 2 is_robot_no
session_id_3 15 nm_uv_id_1 session_id_3 1234567890 1 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_4 24 nm_uv_id_2 session_id_4 1234587890 9 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
session_id_5 30 nm_uv_id_2 session_id_5 123457890 6 is_robot_no
我不明白为什么 reducer 总是为一个特定的键写入相同的键值对。我尝试了几件事,似乎第一个 for 循环(我在其中进行缓存)工作得很好。当我编写 context.write(key,value) 时,我得到了预期的输出。 然而,第二个,当我想在第二个 for 循环中使用缓存时,程序会为我写一些奇怪的东西。
有人可以帮忙吗?
最佳答案
它正在重用相同的Text
缓冲区作为优化。因此您需要手动克隆以缓存它。
我只是改变你的缓存循环:
for (Text value : values) { cache.add(new Text(value)); }
关于java - 在 arraylist 中缓存 iterable 以在 reducer 中迭代两次不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23859699/