amazon-web-services - PIG 中的 DUMP 命令不起作用

标签 amazon-web-services hadoop apache-pig elastic-map-reduce

我编写了一个简单的 PIG 程序,如下所示,用于分析 AWS 上的 google n-grams 数据集的小型修改版本。数据看起来像这样:

I am 1936 942 90
I am 1945 811 5
I am 1951 47 12
very cool 1923 118 10
very cool 1980 320 100
very cool 2012 994 302
very cool 2017 1820 612

并具有以下形式:

n-gram TAB year TAB occurrences TAB books NEWLINE

我编写了以下程序来计算每本书中 ngram 的出现次数:

inp = LOAD <insert input path here> AS (ngram:chararray, year:int, occurences:int, books:int);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

DUMP sum_occ;

但是,DUMP 命令不起作用并给出以下错误:

892520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: GROUP_BY,FILTER
18/03/28 00:56:09 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
1892554 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
18/03/28 00:56:09 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
1892555 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
18/03/28 00:56:09 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
1892591 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
18/03/28 00:56:09 INFO tez.TezLauncher: Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
1892592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler  - File concatenation threshold: 100 optimistic? false
18/03/28 00:56:09 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
1892593 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.AccumulatorOptimizerUtil  - Reducer is to run in accumulative mode.
18/03/28 00:56:09 INFO util.AccumulatorOptimizerUtil: Reducer is to run in accumulative mode.
1892606 [main] INFO  org.apache.pig.builtin.PigStorage  - Using PigTextInputFormat
18/03/28 00:56:09 INFO builtin.PigStorage: Using PigTextInputFormat
18/03/28 00:56:09 INFO input.FileInputFormat: Total input files to process : 1
1892626 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths to process : 1
1892627 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: joda-time-2.9.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: pig-0.17.0-core-h2.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: antlr-runtime-3.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: automaton-1.11-8.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: filter_input,groupinp,inp
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: 
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Set auto parallelism for vertex scope-240
18/03/28 00:56:09 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-240
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: sum_occ
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: sum_occ
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: sum_occ[5,10]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: sum_occ[5,10]
1892745 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: GROUP_BY
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
1892762 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2017: Internal error creating job configuration.
18/03/28 00:56:09 ERROR grunt.Grunt: ERROR 2017: Internal error creating job configuration.
Details at logfile: /mnt/var/log/pig/pig_1522196676602.log

我该如何解决这个问题?

最佳答案

如果您使用的是旧版本,请更新它(应该可以解决您的问题)

PIG 脚本是惰性求值的,因此除非您使用 DUMP 或 STORE 命令,否则您不会知道您的代码有什么问题。

当您运行您的代码时,它会再次抛出以下错误:

ERROR 1025: Invalid field projection. Projected field [occurences] does not exist in schema: group:chararray,filter_input:bag{:tuple(ngram:chararray,year:int,occurences:int,books:int)}.

更改以下行

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(filter_input.occurences) AS socc, SUM(filter_input.books) AS nbooks;

将解决这个错误。

关于amazon-web-services - PIG 中的 DUMP 命令不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49529308/

相关文章:

amazon-web-services - 当用户尝试使用 AWS 注册和创建新账户/相册时,我们收到错误消息。请参阅下面的终端消息

sql - 执行Spark Job时GettingTask不可序列化异常

hadoop - Pig 命令问题 'Failed to read data from "/pigdata/student"'

hadoop - 双冒号在 Pig 中到底是什么意思?

shell - pig 咕shell shell 中的copyFromLocal错误

amazon-web-services - 使用 AWS CLI 启动 CloudFormation 模板时的可选参数

amazon-web-services - 如何将非常大的文件上传到 S3?

node.js - AWS Lex lambda函数Elicit插槽

hadoop - 如何才能仅从hdfs联合中的一个 namespace 中排除某些数据节点?

windows - Windows 中 python 文件的 FileNotFoundException