java - 将 Hadoop MapReduce 输出写入 2 个平面文件

所以我有一个 MapReduce 作业，它接收多篇新闻文章并输出以下键值对。

.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.

如您所见，我目前正在输出两种主要不同类型的数据行，而现在，它们在 mapreduce 输出的平面文件中混合在一起。无论如何我可以简单地将 social_tags 输出到 file1 和 topic_codes 到 file2 或者输出 social_tags 到指定的文件组(social1.txt，social2.txt ..etc)和 topic_codes 到另一组(topic1.txt，topic2.txt。 ..等等)

我问这个的原因是为了以后可以轻松地将所有这些存储到 Hive 表中。我最好为每种不同的数据类型(topic_code、social_tag 等)创建一个单独的表也很有帮助。

提前致谢!

最佳答案

您可以按照已经建议的那样使用 MultipleOutputs。正如您所要求的那样，无需将 mapreduce 输出分离到不同文件即可实现此目的的简单方法。这是一个快速的方法，如果数据量不是很大的话!!!而且区分数据的逻辑也不是太复杂。

首先将混合输出文件加载到配置单元表(比如 main_table)中。然后你可以创建两个不同的表(topic_code，social_tag)，并从主表中插入通过where子句过滤后的数据。

    hive > insert into table topic_code
         > select * from main_table
         > where $condition;

    // $condition = the logic you would use to differentiate the records in the MR job

关于java - 将 Hadoop MapReduce 输出写入 2 个平面文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17184198/

java - 将 Hadoop MapReduce 输出写入 2 个平面文件

上一篇：hadoop - 在 Pig Latin 中将袋子变成数组

下一篇：hadoop - 如何在代码中获取 Hadoop 中的统计信息？