hadoop - 使用 Pig 计算每行中的单词出现次数

标签 hadoop apache-pig

我有一组包含许多不同字段的推文

raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text, 
in_reply_to_status_id, favorite_count, source, coordinates, entities, 
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet, 
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);

我想找出每个日期最常用的词。我设法按日期对文本进行分组:

r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE 
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));

现在看起来像:

(date text text3)
(date2 text2)

我删除了特殊字符,因此文本中只出现“真实”的单词。 示例:

2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here

我希望输出看起来像

2017-06-18 the are is
2017-06-19 more words and

我真的不在乎这个词出现了多少次。我只想显示最常见的,如果两个词出现的次数相同,则显示其中任何一个。

最佳答案

虽然我确信有一种方法可以完全在 Pig 中完成此操作,但它可能比必要的更困难。

UDFs在我看来,这是可行的方法,而 Python 只是我要展示的一个选项,因为它可以在 Pig 中快速注册。

例如,

输入.tsv

2017-06-18  the plants are green the dog is black there are words this is
2017-06-19  more words and even more words another phrase begins here

py_udfs.py

from collections import Counter
from operator import itemgetter

@outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
    ''' Does a word count of a sentence and orders common words first '''
    words = Counter()
    for w in sentence.split():
        words[w] += 1
    values = ((word,count) for word,count in words.items())
    return sorted(values,key=itemgetter(1),reverse=True)

script.pig

REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B

输出

(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})

不过,如果你正在做文本分析,我会建议

  • 删除停用词
  • 词形还原/词干化
  • 使用 Apache Spark

关于hadoop - 使用 Pig 计算每行中的单词出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44618425/

相关文章:

hadoop - 如何找到 Pig 中一列的平均值和两列相减的平均值?

hadoop - 创建Oozie共享库

java - 次要名称节点未启动

hadoop - 测试运行后 HBASE DB 大小增加

apache-pig - 在函数的返回值上使用 Pig 的 ORDER BY

hadoop - Jython 在 Pig 的 UDF 上下文中的局限性

apache-pig - Apache Pig rank函数的使用

Hadoop 支持 php、ruby

java -/usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-*.jar不存在

hadoop - 如何在 Apache Pig 中对多个展平列进行分组