hadoop - pig : How to save relation when "Scalar has more than two rows in the output"

标签 hadoop apache-pig

因此,我正在处理一个包含 http 流量条目的日志文件。我正在尝试确定每个状态代码一天中每个小时的记录数。 所以,我的想法输出是这样的:

0 (200, 234) (201, 100) (404, 5553)
1 (200, 2234) (201, 1100) (404, 53)
....

我有以下转换:

e1 = group LINES BY (hour, statusCode);
e2 = foreach e1 generate group.hour, group.statusCode, COUNT(LINES);
e3 = group e2 by hour;
e4 = foreach e3 {
    statusCount = foreach e2 generate statusCode, $2;
    generate e3.group, statusCount;
};

当我尝试“转储 e4”时,我收到以下错误消息:

Scalar has more than one row in the output. 1st : (0,{(0,000,1),(0,200,951),(0,206,1),(0,302,4),(0,304,20),(0,403,118),(0,500,6)}), 2nd :(1,{(1,200,781),(1,301,1),(1,304,14),(1,400,1),(1,403,111),(1,502,12)})

如您所见,值就在那里,我只需要保存它们...但是如何保存?我试着做一个

e5 = foreach e4 generate group, statusCount;

但我得到了相同的输出。我知道我缺少一些基本的东西,但我不知道是什么..

--

最佳答案

您可以轻松解决此问题,但挑战在于您提到的输出格式。

选项 1:
在标准 pig 的情况下,您将始终获得以下输出格式(即包将包含您的输出)。

PigScript:

A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D GENERATE group,C.(statusCode,cnt);
STORE E INTO 'output' USING PigStorage();

输出:

0   {(302,2),(304,3),(403,1),(500,1)}
1   {(200,1),(301,1),(304,2),(400,1),(403,1),(502,5)}

选项 2:
如果你想实现你提到的输出格式,那么你必须使用 Custom UDF BagToTuple来自 piggybank.jar.从此链接下载 jar 文件 http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm并尝试以下方法。

PigScript:

REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D {
                   mytuple = FOREACH C GENERATE TOTUPLE(statusCode,cnt);
                   GENERATE group,FLATTEN(BagToTuple(mytuple));
              }
STORE E INTO 'output1' USING PigStorage();

输出:

0   (302,2) (304,3) (403,1) (500,1)
1   (200,1) (301,1) (304,2) (400,1) (403,1) (502,5)

传递给脚本的示例输入:

输入

0       302
0       302
0       304
0       304
0       304
0       403
0       500
1       200
1       301
1       304
1       304
1       400
1       403
1       502
1       502
1       502
1       502
1       502

关于hadoop - pig : How to save relation when "Scalar has more than two rows in the output",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27970918/

相关文章:

json - HBase从具有行ID的任意JSON插入

hadoop - Apache Pig 0.8.1双NaN

java - 读取 HDFS 文件拆分

python - 不同数量的 map task (1、2、4 ..)之间没有性能差异

hadoop - 运行 PIG 脚本时出错

hadoop - 当您使用 Pig Latin 有许多小输入文件时提高性能

hadoop - 在Pig中生成有序文件时出现问题

hadoop - 在 PIG (Hadoop) 中将输入拆分为子字符串

java - Hadoop:堆空间和gc问题

rest - 将 Ambari 与非 HDP 组件一起使用