hadoop - 在简单的Pig脚本中出现错误2017

标签 hadoop apache-pig

这是我的整个脚本。它应该在Gutenberg项目etext中查找,并除去页眉和页脚文本,仅保留本书的实际文本,因此可以将其用于进一步的分析。

ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line,0,41)=='*** START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
--STORE headers INTO '/user/PHIBBS/headers' USING PigStorage;

footer = FILTER ranked BY SUBSTRING(line,0,39)=='*** END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
--STORE footers INTO '/user/PHIBBS/footers' USING PigStorage;

blocks =  JOIN headers BY $0, footers BY $0;
sectioned = CROSS blocks, ranked;
--STORE sectioned INTO '/user/PHIBBS/sectioned';

book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '/user/PHIBBS/clean/$ebook';

它以“ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.”失败

如果我尝试只运行脚本的一个子集,那么直到最后一行都可以。如果尝试运行前5行以及注释掉的STORE行,那很好。如果我运行接下来的3行加上下一条注释掉的STORE行,它将失败。如果我禁用了STORE行的EITHER,它就可以正常工作。因此,每个单独的STORE语句都没有问题。两个都? ERROR 2017!有什么建议么?我尝试了两种不同的发行版,一种是Hortonworks发行的,另一种是Cloudera发行的,它们是从各自的网站上新下载的干净VM镜像。

最佳答案

考虑到您要删除页眉/页脚并仅拥有书的目标,除了书和页眉/页脚之外,您实际上不需要存储其他任何内容。我认为您的问题是blocks = JOIN headers BY $0, footers BY $0;,它对仅加载一次的数据进行自连接。我下载了《 war 与和平》,该代码对我有用。

$ pig -x local
# grunt>

ebook = LOAD 'pg2600.txt' USING PigStorage() AS (line:chararray);
ranked = RANK ebook;

header = FILTER ranked BY SUBSTRING(line, 0, 41) == 'START OF THIS PROJECT GUTENBERG EBOOK';
hlines = FOREACH header GENERATE $0;
headers = RANK hlines;
STORE headers INTO 'headers' USING PigStorage();

footer = filter ranked by SUBSTRING(line, 0, 39) == 'END OF THIS PROJECT GUTENBERG EBOOK';
flines = FOREACH footer GENERATE $0;
footers = RANK flines;
STORE footers INTO 'footers' USING PigStorage();

/* Now re-load headers and footers for join */

h_new = LOAD 'headers/part-m-00000' USING PigStorage() AS (id:int, col1:int);
f_new = LOAD 'footers/part-m-00000' USING PigStorage() AS (id:int, col1:int);

blocks = JOIN h_new BY id, f_new BY id;
sectioned = CROSS blocks, ranked;
book = FILTER sectioned BY $4 > $1 AND $4 < $3;
STORE book INTO '__book__';

关于hadoop - 在简单的Pig脚本中出现错误2017,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24208942/

相关文章:

java - Hbase 按列限定符排序

java - Parquet 文件可选字段不存在

Azure 上的 Node.js 和 HBase (HDInsight)

hadoop - 在hadoop中合并两个文件

java - 如何加入Pig输出文件?

hadoop - 如何在 pig 中生成行号?

hadoop - 在PIG中执行DUMP命令时出错

java - 仅在 Mapper 作业上写入值

hadoop - DUMP 不能输出任何东西

hadoop - 如何在PIG中导入/加载.csv文件?