我的Hive插入查询因以下错误而失败:
java.lang.OutOfMemoryError:超出了GC开销限制
表2中的数据= 1.7tb
查询:
set hive.exec.dynamic.partition.mode= nonstrict;set hive.exec.dynamic.partition=true;set mapreduce.map.memory.mb=15000;set mapreduce.map.java.opts=-Xmx9000m;set mapreduce.reduce.memory.mb=15000;set mapreduce.reduce.java.opts=-Xmx9000m;set hive.rpc.query.plan=true;
insert into database1.table1 PARTITION(trans_date) select * from database1.table2;
Error info: Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. GC overhead limit exceeded
cluster info : total memory : 1.2TB total vcores :288 total nodes : 8 node version : 2.7.0-mapr-1808
请注意:
我正在尝试将数据以拼花格式从表2插入到ORC格式的表1中。
数据大小总计为1.8TB。
最佳答案
添加按分区键分发可以解决问题:
insert into database1.table1 PARTITION(trans_date) select * from database1.table2
distribute by trans_date;
distribute by trans_date
将触发reducer步骤,并且每个reducer将处理单个分区,这将减少内存压力。当每个进程写入多个分区时,它在内存中为ORC保留了太多缓冲区。还可以考虑添加此设置来控制每个化简器将处理多少数据:
set hive.exec.reducers.bytes.per.reducer=67108864; --this is example only, reduce the figure to increase parallelism
关于apache-spark - HIVE:插入查询失败,错误为 “java.lang.OutOfMemoryError: GC overhead limit exceeded”,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59765693/