因此,我正在处理一个包含 http 流量条目的日志文件。我正在尝试确定每个状态代码一天中每个小时的记录数。 所以,我的想法输出是这样的:
0 (200, 234) (201, 100) (404, 5553)
1 (200, 2234) (201, 1100) (404, 53)
....
我有以下转换:
e1 = group LINES BY (hour, statusCode);
e2 = foreach e1 generate group.hour, group.statusCode, COUNT(LINES);
e3 = group e2 by hour;
e4 = foreach e3 {
statusCount = foreach e2 generate statusCode, $2;
generate e3.group, statusCount;
};
当我尝试“转储 e4”时,我收到以下错误消息:
Scalar has more than one row in the output. 1st : (0,{(0,000,1),(0,200,951),(0,206,1),(0,302,4),(0,304,20),(0,403,118),(0,500,6)}), 2nd :(1,{(1,200,781),(1,301,1),(1,304,14),(1,400,1),(1,403,111),(1,502,12)})
如您所见,值就在那里,我只需要保存它们...但是如何保存?我试着做一个
e5 = foreach e4 generate group, statusCount;
但我得到了相同的输出。我知道我缺少一些基本的东西,但我不知道是什么..
--
最佳答案
您可以轻松解决此问题,但挑战在于您提到的输出格式。
选项 1:
在标准 pig 的情况下,您将始终获得以下输出格式(即包将包含您的输出)。
PigScript:
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D GENERATE group,C.(statusCode,cnt);
STORE E INTO 'output' USING PigStorage();
输出:
0 {(302,2),(304,3),(403,1),(500,1)}
1 {(200,1),(301,1),(304,2),(400,1),(403,1),(502,5)}
选项 2:
如果你想实现你提到的输出格式,那么你必须使用 Custom UDF BagToTuple
来自 piggybank.jar.
从此链接下载 jar 文件 http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm并尝试以下方法。
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D {
mytuple = FOREACH C GENERATE TOTUPLE(statusCode,cnt);
GENERATE group,FLATTEN(BagToTuple(mytuple));
}
STORE E INTO 'output1' USING PigStorage();
输出:
0 (302,2) (304,3) (403,1) (500,1)
1 (200,1) (301,1) (304,2) (400,1) (403,1) (502,5)
传递给脚本的示例输入:
输入
0 302
0 302
0 304
0 304
0 304
0 403
0 500
1 200
1 301
1 304
1 304
1 400
1 403
1 502
1 502
1 502
1 502
1 502
关于hadoop - pig : How to save relation when "Scalar has more than two rows in the output",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27970918/