这是我的整个脚本。它应该在Gutenberg项目etext中查找,并除去页眉和页脚文本,仅保留本书的实际文本,因此可以将其用于进一步的分析。
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;
footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;
blocks = JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';
它以“
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
”失败如果我尝试只运行脚本的一个子集,那么直到最后一行都可以。如果尝试运行前5行以及注释掉的
STORE
行,那很好。如果我运行接下来的3行加上下一条注释掉的STORE
行,它将失败。如果我禁用了STORE行的EITHER,它就可以正常工作。因此,每个单独的STORE
语句都没有问题。两个都? ERROR 2017
!有什么建议么?我尝试了两种不同的发行版,一种是Hortonworks发行的,另一种是Cloudera发行的,它们是从各自的网站上新下载的干净VM镜像。
最佳答案
考虑到您要删除页眉/页脚并仅拥有书的目标,除了书和页眉/页脚之外,您实际上不需要存储其他任何内容。我认为您的问题是blocks = JOIN headers BY $0, footers BY $0;
,它对仅加载一次的数据进行自连接。我下载了《 war 与和平》,该代码对我有用。
$ pig -x local
# grunt>
ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;
header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();
footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();
/* Now re-load headers and footers for join */
h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';
关于hadoop - 在简单的Pig脚本中出现错误2017,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24208942/