java - 比较Hadoop MapReduce中的两个文件

嗨，我是Hadoop和mapreduce的新手。.我想知道是否可以进行这种操作。
我正在尝试通过Mapreduce比较两个文件。
第一个文件可能看起来像这样:

t1 r1
t2 r2
t1 r4

第二个文件将如下所示:

u1 t1 r1
u2 t2 r3
u3 t2 r2
u4 t1 r1

我希望它根据文件发出u1，u3和u4。第二个文件将比第一个文件大得多。我不太确定如何比较这些文件。这在一项MapReduce作业中可行吗？如果需要，我愿意链接MapReduce作业。

最佳答案

您可以通过将第一个文件放在分布式缓存中并在map阶段遍历第二个文件来进行mapside连接。

如何从分布式缓存中读取:

@Override
        protected void setup(Context context) throws IOException,InterruptedException
        {
            Path[] filelist=DistributedCache.getLocalCacheFiles(context.getConfiguration());
            for(Path findlist:filelist)
            {
                if(findlist.getName().toString().trim().equals("mapmainfile.dat"))
                {

                    fetchvalue(findlist,context);
                }
            }

        }
        public void fetchvalue(Path realfile,Context context) throws NumberFormatException, IOException
        {
            BufferedReader buff=new BufferedReader(new FileReader(realfile.toString()));
           //some operations with the file
        }

如何将文件添加到分布式缓存:

DistributedCache.addCacheFile(new URI("/user/hduser`/test/mapmainfile.dat"),conf);`

关于java - 比较Hadoop MapReduce中的两个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32985439/

上一篇：arrays - 从HIVE中的给定开始日期和结束日期创建序列数组

下一篇：java - 每个文件计数的 WordCount 示例

相关文章：

hadoop - Hbase或HDFS会更好

hadoop - 在 EMR 3.10 中添加步骤或引导操作以将文件从本地复制到 s3

hadoop - 许多输入文件到 SINGLE 映射。哈多普。如何？

python - Spark - 字数统计测试

java - 从 0.6.8 版本迁移 setLibraryPaths 方法

java - 加载图像作为资源返回 null

hadoop - 在hive中的所有数据库中搜索一个表

hadoop - 仅在 mapreduce 模式下出现 Pig 0.13 错误

java - 打印奇数行时出现问题

java - 故障排除: JDialog which is modal and yet not modal?