hadoop - Hive上小文件的性能问题

我正在阅读有关小文件如何降低配置单元查询性能的文章。
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1

我了解有关重载NameNode的第一部分。

但是，他所说的重新升级 map-reduce 似乎没有发生。对于 map-reduce 和 Tez 。

When a MapReduce job launches, it schedules one map task per block of data being processed

我看不到每个文件都创建了mapper任务，可能是因为他指的是map-reduce 的版本1，此后做了很多更改。

Hive版本: Hive 1.2.1000.2.6.4.0-91

我的表格:
create table temp.emp_orc_small_files (id int, name string, salary int) stored as orcfile;

数据:
以下代码将创建100个小文件，其中仅包含少量kb的数据。
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done

但是，我看到仅一个映射器和一个reducer 任务正在为以下查询创建。
[root@sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files" log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender. Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1536258296893_0257) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s -------------------------------------------------------------------------------- OK 4989 Time taken: 13.643 seconds, Fetched: 1 row(s)

与map-reduce相同的结果。
hive> set hive.execution.engine=mr; hive> select max(salary) from temp.emp_orc_small_files; Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/ Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0% 2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec 2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec MapReduce Total cumulative CPU time: 7 seconds 360 msec Ended Job = job_1536258296893_0259 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 360 msec OK 4989

最佳答案

这是因为以下配置正在生效
hive.hadoop.supports.splittable.combineinputformat

来自documentation

Whether to combine small input files so that fewer mappers are spawned.

因此，从本质上讲，Hive可以推断出输入是一组小于块大小的小文件，并将它们组合起来可以减少所需的映射器数量。

关于hadoop - Hive上小文件的性能问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52283578/

hadoop - Hive上小文件的性能问题

上一篇：node.js - 我的 docker 容器在哪个端口上运行着 React 应用程序？ Reacts 的默认 PORT 或来自 Dockerfile 的 EXPOSE？

下一篇：docker - 如果主节点死在kubernetes中会怎样？该如何解决？