hadoop - pig 脚本: count returns 0 on null field

标签 hadoop scripting count apache-pig mortar

我有一个 pig 脚本,它通过 json 的“公司”部分加载文件。当我执行计数时,如果文件中缺少域(或为空),则计数为 0。我怎样才能将它分组为空字符串并仍然对其进行计数?

文件示例:

{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"name": "test4 company"}}
{"company": {"name": "test4 company"}}

预期结果:

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
"", "test4 company", 2

实际结果:

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
, "test4 company", 0

当前的 pig 脚本:

data = LOAD'myfile' USINGorg.apache.pig.piggybank.storage.JsonLoader('company:   (domain:chararray, name:chararray)');
filtered = FILTER data BY (company is not null);
events = FOREACH filtered GENERATE FLATTEN(company) as (domain, name);
grouped = GROUP events BY (domain, name);
counts = FOREACH grouped GENERATE group as domain, COUNT(events) as count;
ordered = ORDER counts by count DESC;

感谢您的帮助!

最佳答案

而不是 COUNT 尝试 COUNT_STAR,

counts = FOREACH 分组的 GENERATE 组作为域,COUNT_STAR(events) 作为计数;

关于hadoop - pig 脚本: count returns 0 on null field,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26612977/

相关文章:

scripting - 您更喜欢编译语言还是脚本语言?

php - 蛋糕PHP : how can i display count for each category?

python - 机器人计数命令不一致

hadoop - Spark Controller 和 SAP Vora 之间的区别

hadoop - 如何知道在 YARN 客户端模式下带有 spark-shell 的 ClosedChannelException 的原因是什么?

Hadoop 复制模型 - DataStreamer/Namenode

java - 从 Jython 中的 Java 项目访问方法

bash - 两次替换 bash 脚本变量

如果字段值大于 1 且不小于 0,Mysql 计算 datediff 总行数

java - 更改Hadoop HDFS文件名