hadoop - 运行Hive查询需要多少可用磁盘空间

标签 hadoop hive

我运行以下 hive 查询

create table table_llv_N_C as select table_line_n_passed.chromosome_number,table_line_n_passed.position,table_line_c_passed.id from table_line_n_passed join table_line_c_passed on (table_line_n_passed.chromosome_number=table_line_c_passed.chromosome_number)



并出现以下错误...... org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=1) {"key":{"joinkey0":"12"},"value":{"_col2":"."},"alias":1} at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258) ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hive-root/hive_2015-03-09_10-03-59_970_3646456754594156815-1/_task_tmp.-ext-10001/_tmp.000000_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and no node(s) are excluded in this operation. ......
根本原因可能是HDFS群集中的磁盘空间不足。磁盘空间的详细信息是
hdfs dfs -df -hFilesystem Size Used Available Use%hdfs://x.y.ab.com:8020 159.7 G 21.9 G 110.7 G 14%

table_line_n_passed具有4767409行和1.1 G大小。

类似地table_line_c_passed具有4717082行和1.0 G大小。

Hive是否真的需要那么多的空间(然后是更多可用空间110 G)来处理数据。如何计算在运行查询之前需要多少可用空间。以任何方式在可用空闲空间内运行查询。

PS:如果我在上述查询中使用LIMIT 10000,则其运行状况良好。

执行计划
EXPLAIN create table table_llv_N_C as select table_line_n_passed.chromosome_number,table_line_n_passed.position,table_line_c_passed.id from table_line_n_passed join table_line_c_passed on (table_line_n_passed.chromosome_number=table_line_c_passed.chromosome_number);


抽象语法树:
(TOK_CREATETABLE (TOK_TABNAME table_llv_N_C) TOK_LIKETABLE (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME table_line_n_passed)) (TOK_TABREF (TOK_TABNAME table_line_c_passed)) (= (. (TOK_TABLE_OR_COL table_line_n_passed) chromosome_number) (. (TOK_TABLE_OR_COL table_line_c_passed) chromosome_number)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL table_line_n_passed) chromosome_number)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL table_line_n_passed) position)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL table_line_c_passed) id))))))STAGE DEPENDENCIES:(Stage-1 is a root stageStage-0 depends on stages: Stage-1Stage-4 depends on stages: Stage-0Stage-2 depends on stages: Stage-4

STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: table_line_c_passed TableScan alias: table_line_c_passed Reduce Output Operator key expressions: expr: chromosome_number type: string sort order: + Map-reduce partition columns: expr: chromosome_number type: string tag: 1 value expressions: expr: id type: string table_line_n_passed TableScan alias: table_line_n_passed Reduce Output Operator key expressions: expr: chromosome_number type: string sort order: + Map-reduce partition columns: expr: chromosome_number type: string tag: 0 value expressions: expr: chromosome_number type: string expr: position type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col2} handleSkewJoin: false outputColumnNames: _col0, _col1, _col14 Select Operator expressions: expr: _col0 type: string expr: _col1 type: int expr: _col14 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat name: bright.table_llv_N_C

Stage: Stage-0 Move Operator files: hdfs directory: true destination: hdfs://cheetah.xxx.yyyy.in:8020/user/hive/warehouse/bright.db/table_llv_n_c

Stage: Stage-4 Create Table Operator: Create Table columns: chromosome_number string, position int, id string if not exists: false input format: org.apache.hadoop.mapred.TextInputFormat # buckets: -1 output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat name: table_llv_N_C isExternal: false

Stage: Stage-2 Stats-Aggr Operator



花费时间:0.146秒

最佳答案

转到此链接:https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

并寻找这个标签。或在页面上搜索内存使用量以轻松到达那里。

hive.map.aggr.hash.force.flush.memory.threshold

也请引用此标签或第二次搜索内存使用情况
hive.mapjoin.localtask.max.memory.usage

关于hadoop - 运行Hive查询需要多少可用磁盘空间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28957347/

相关文章:

hadoop - Hive 外部表架构重新连接

scala - s3中的数据分区

Hadoop优化建议

apache - Hadoop 3.2.0在群集中不起作用(VirtualBox)

hadoop - 如何在hadoop中使用JobControl

sql - 从 Hive 中的多个表中选择增量数据

windows - 格式hadoop namenode给出错误cygwin

hadoop - 如何使hadoop mapreduce的输出作为静态api的输入?

hadoop - Hive alter table change column name 将 'NULL' 赋予重命名的列

Hive Map-Join 配置之谜