hadoop - Hive Testbench数据生成失败

标签 hadoop hive yarn benchmarking tez

我克隆了Hive Testbench,以尝试在使用Hadoop v2.9.0,Hive 2.3.0和Tez 0.9.0的Apache二进制发行版构建的hadoop集群上运行Hive基准测试。

我设法完成了两个数据生成器的构建:TPC-H和TPC-DS。然后,在TPC-H和TPC-DS上进行数据生成的下一步全部失败。该故障非常一致,每次每次都会在完全相同的步骤中发生故障并产生相同的错误消息。

对于TPC-H,数据生成屏幕输出如下:

$ ./tpch-setup.sh 10
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
...
18/01/02 14:43:00 INFO mapreduce.Job: Running job: job_1514226810133_0050
18/01/02 14:43:01 INFO mapreduce.Job: Job job_1514226810133_0050 running in uber mode : false
18/01/02 14:43:01 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 14:44:38 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 14:44:39 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 14:44:46 INFO mapreduce.Job:  map 30% reduce 0%
18/01/02 14:44:48 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 14:44:58 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 14:45:14 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 14:45:15 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: Job job_1514226810133_0050 completed successfully
18/01/02 14:45:23 INFO mapreduce.Job: Counters: 0
SLF4J: Class path contains multiple SLF4J bindings.
...
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Data generation failed, exiting.

对于TPC-DS,错误消息在这里:
$ ./tpcds-setup.sh 10
...
18/01/02 22:13:58 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
18/01/02 22:13:58 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:13:59 INFO input.FileInputFormat: Total input files to process : 1
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: number of splits:10
18/01/02 22:13:59 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/01/02 22:13:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514226810133_0082
18/01/02 22:14:00 INFO client.YARNRunner: Number of stages: 1
18/01/02 22:14:00 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
18/01/02 22:14:00 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
18/01/02 22:14:00 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:00 INFO client.TezClient: Submitting DAG application with id: application_1514226810133_0082
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://192.168.10.15:8020/apps/tez,hdfs://192.168.10.15:8020/apps/tez/lib/
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/01/02 22:14:00 INFO client.TezClient: Tez system stage directory hdfs://192.168.10.15:8020/tmp/hadoop-yarn/staging/rapids/.staging/job_1514226810133_0082/.tez/application_1514226810133_0082 doesn't exist and is created
18/01/02 22:14:01 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1514226810133_0082, dagName=GenTable+all_10
18/01/02 22:14:01 INFO impl.YarnClientImpl: Submitted application application_1514226810133_0082
18/01/02 22:14:01 INFO client.TezClient: The url to track the Tez AM: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:05 INFO mapreduce.Job: The url to track the job: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO mapreduce.Job: Running job: job_1514226810133_0082
18/01/02 22:14:06 INFO mapreduce.Job: Job job_1514226810133_0082 running in uber mode : false
18/01/02 22:14:06 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 22:15:51 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 22:15:54 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 22:15:55 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 22:15:56 INFO mapreduce.Job:  map 50% reduce 0%
18/01/02 22:16:07 INFO mapreduce.Job:  map 60% reduce 0%
18/01/02 22:16:09 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 22:16:11 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 22:16:19 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: Job job_1514226810133_0082 completed successfully
18/01/02 22:19:54 INFO mapreduce.Job: Counters: 0
...
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table time_dim (2/24).
Optimizing table date_dim (1/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_demographics (5/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table web_page (15/24).
Optimizing table catalog_page (16/24).
Optimizing table web_site (17/24).
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
make: *** [store_returns] Error 2
Data loaded into database tpcds_bin_partitioned_orc_10.

我注意到作业运行过程中以及失败后目标的临时HDFS目录始终为空,除了生成的子目录。

现在,我什至不知道故障是否是由于Hadoop配置问题,软件版本不匹配或任何其他原因引起的。有什么帮助吗?

最佳答案

运行这项工作时,我遇到了类似的问题。当我将hdfs位置指定给该脚本的写入权限时,该脚本就成功了。

./tpcds-setup.sh 10 <hdfs_directory_path>

脚本启动时仍然出现此错误:
Data loaded into database tpcds_bin_partitioned_orc_10.
ls: `<hdfs_directory_path>/10': No such file or directory

但是,脚本可以成功运行,并且最终会生成数据并将其加载到配置单元表中。

希望能有所帮助。

关于hadoop - Hive Testbench数据生成失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48063143/

相关文章:

hadoop - 完全取消默认输出目录 - MapReduce

java - WritableStringObjectInspector 无法转换为 BooleanObjectInspector

hadoop - 从 org.apache.hadoop.hive.ql.exec.DDLTask 创建配置单元表 : FAILED: Execution Error, 返回代码 1 时出错。元异常

scala - 如何从年月日分区列的列表中提取最新/最近的分区

hadoop - 关于在 hadoop 3.1.1 中找不到 “YarnChild” 类的任何想法?

scala - 为什么我的 Spark 应用程序无法使用 "object SparkSession is not a member of package"进行编译,但 spark-core 是依赖项?

apache - 为什么配置单元中的桶数应等于 reducer 数?

mysql - Sqoop导入数据到hive和hdfs

hadoop - Hadoop Configuration()对象未获取/etc/hadoop/conf/core-site.xml

sql - 如何优化Spark sql以并行运行