amazon-web-services - 未设置 Pig 模式元组。不会生成代码

标签 amazon-web-services hadoop mapreduce cloud apache-pig

我在 google n-grams 数据集上对 pig 运行了以下命令:

inp = LOAD 'link to file' AS (ngram:chararray, year:int, occurences:float, books:float);

filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);

groupinp = GROUP filter_input BY ngram;

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) as ngram, SUM(filter_input.occurences) / SUM(filter_input.books) AS ntry;

roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );

但是我得到以下错误:

DUMP roundto;
601062 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan  - Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
18/04/06 01:46:03 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
601067 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: GROUP_BY,FILTER
18/04/06 01:46:03 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
601111 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
18/04/06 01:46:03 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
601111 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
18/04/06 01:46:03 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
601238 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
18/04/06 01:46:03 INFO tez.TezLauncher: Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
601239 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler  - File concatenation threshold: 100 optimistic? false
18/04/06 01:46:03 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
601241 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil  - Choosing to move algebraic foreach to combiner
18/04/06 01:46:03 INFO util.CombinerOptimizerUtil: Choosing to move algebraic foreach to combiner
601265 [main] INFO  org.apache.pig.builtin.PigStorage  - Using PigTextInputFormat
18/04/06 01:46:03 INFO builtin.PigStorage: Using PigTextInputFormat
18/04/06 01:46:03 INFO input.FileInputFormat: Total input files to process : 1
601285 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths to process : 1
601285 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: joda-time-2.9.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: pig-0.17.0-core-h2.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: antlr-runtime-3.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: automaton-1.11-8.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: filter_input,groupinp,inp,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp,sum_occ
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: 
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: 
601449 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Set auto parallelism for vertex scope-142
18/04/06 01:46:03 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-142
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: roundto,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: roundto,sum_occ
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: sum_occ[4,10],roundto[6,10]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: sum_occ[4,10],roundto[6,10]
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: GROUP_BY
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
601489 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Total estimated parallelism is 2
18/04/06 01:46:04 INFO tez.TezJobCompiler: Total estimated parallelism is 2
601531 [PigTezLauncher-0] INFO  org.apache.pig.tools.pigstats.tez.TezScriptState  - Pig script settings are added to the job
18/04/06 01:46:04 INFO tez.TezScriptState: Pig script settings are added to the job
18/04/06 01:46:04 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.8.4, revision=300391394352b074b85b529e870816a72c6f314a, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2018-03-21T23:55:28Z ]
18/04/06 01:46:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
18/04/06 01:46:04 INFO client.TezClient: Using org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to manage Timeline ACLs
18/04/06 01:46:04 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-28-12.ec2.internal:8188/ws/v1/timeline/
18/04/06 01:46:04 INFO client.TezClient: Session mode. Starting session.
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs:///apps/tez/tez.tar.gz
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/04/06 01:46:04 INFO client.TezClient: Tez system stage directory hdfs://ip-172-31-28-12.ec2.internal:8020/tmp/temp-336429202/.tez/application_1522978297921_0003 doesn't exist and is created
18/04/06 01:46:04 INFO acls.ATSHistoryACLPolicyManager: Created Timeline Domain for History ACLs, domainId=Tez_ATS_application_1522978297921_0003
18/04/06 01:46:04 INFO impl.YarnClientImpl: Submitted application application_1522978297921_0003
18/04/06 01:46:04 INFO client.TezClient: The url to track the Tez Session: http://ip-172-31-28-12.ec2.internal:20888/proxy/application_1522978297921_0003/
607861 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO tez.TezJob: Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.TezClient: Submitting dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2, callerContext={ context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-default-d73e19dc-5287-4ee2-a85d-e931327011dc }
18/04/06 01:46:10 INFO client.TezClient: Submitted dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
608409 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
18/04/06 01:46:10 INFO tez.TezJob: Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
608528 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - HadoopJobId: job_1522978297921_0003
18/04/06 01:46:11 INFO tez.TezLauncher: HadoopJobId: job_1522978297921_0003
609410 [Timer-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:11 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
629410 [Timer-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:31 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
646404 [pool-1-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager  - Shutting down Tez session org.apache.tez.client.TezClient@3a371843
18/04/06 01:46:48 INFO tez.TezSessionManager: Shutting down Tez session org.apache.tez.client.TezClient@3a371843
2018-04-06 01:46:48 Shutting down Tez session , sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
18/04/06 01:46:48 INFO client.TezClient: Shutting down Tez Session, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003

如何修复此错误?转储命令适用于除 roundto 以外的前几行。 Tez 客户端到底是什么?

最佳答案

我无法复制你的输出,因为我一尝试这一行就收到错误:

roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );

您不需要使用 dot operator引用这些字段(例如 sum_occ.ngram),因为它们没有嵌套在元组或包中。在没有点运算符的情况下尝试上面的行:

roundto = FOREACH sum_occ GENERATE ngram, ROUND_TO( ntry , 2 );

要回答您的第二个问题,MapReduce 和 Tez 都是可用于运行 Pig 脚本的框架。 Tez 有时可以加快 Pig 脚本的运行时间。您可以通过使用 pig -x mapreducepig -x tez 启动 Pig shell 来显式使用 MapReduce 或 Tez。 MapReduce 是默认设置,因此如果您未指定 Tez,则必须将 Hadoop 集群设置为在 Tez 中运行 Pig。

关于amazon-web-services - 未设置 Pig 模式元组。不会生成代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49684096/

相关文章:

amazon-web-services - 通过云形成创建并执行 aws lambda 函数

tomcat - 无法通过 .ebextensions 文件更改 AWS 中的 tomcat 配置

hadoop - 自定义 Hadoop 类型的 ArrayWritable 实现

hadoop - reducer 类不能启动吗?在 reducer 日志中看不到 System.out.println 语句

amazon-web-services - 我想知道如何通过修改此 lambda 代码将数据导入应用程序

python - Centos - 没有名为 yum 的模块

hadoop - Reducer 无法针对不同的映射器按键分组

hadoop - 为配置单元配置 oracle sql 开发人员

java - 如何测试Hadoop工作绩效

eclipse - 当hadoop不在同一主机中时,从Eclipse执行MapReduce时出错