performance - 如何强制 hive 在 reducer 之间从另一个表插入覆盖到分区表中时均匀分布行以提高性能

标签 performance hive hiveql

我想从另一个配置单元表插入分区配置单元表。数据位于目标表的单个分区中。问题是所有 reducer 完成速度都非常快,但其中一个 reducer 需要很长时间,因为所有工作都将转到该单个 reducer 。

我想找到一种方法来设置所有 reducer 之间平均分配的工作。有什么办法可以做到吗?如何提高插入覆盖的性能?

源表 DDL:

 CREATE EXTERNAL TABLE employee ( id INT,first_name String,latst_name String,email String,gender String) STORED AS TEXTFILE '/emp/data'

目标表 DDL:

 CREATE EXTERNAL TABLE employee_stage ( id INT,first_name String,latst_name String,email String,gender String) PARTITIONED BY (batch_id bigint) STORED AS ORC LOCATION '/stage/emp/data'

这是数据快照

1   Helen   Perrie  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="422a322730302b2772022e372e376c212d2f" rel="noreferrer noopener nofollow">[email protected]</a>   Female
2   Rafaelita   Jancso  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="126078737c71617d235271767073706b3c717d7f" rel="noreferrer noopener nofollow">[email protected]</a> Female
3   Letti   Kelley  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c4a8afa1a8a8a1bdf684b7a8ada0a1b7aca5b6a1eaaaa1b0" rel="noreferrer noopener nofollow">[email protected]</a> Female
4   Adela   Dmisek  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f796939a9e84929cc4b78483968392d9909881" rel="noreferrer noopener nofollow">[email protected]</a>  Female
5   Lay Reyner  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bad6c8dfc3d4dfc88efacdd3c8dfde94d9d5d7" rel="noreferrer noopener nofollow">[email protected]</a>  Male
6   Robby   Felder  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="730115161f17160146331e1a10011c001c15075d101c1e" rel="noreferrer noopener nofollow">[email protected]</a>  Male
7   Thayne  Brunton <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="186c7a6a6d766c77762e586b6d76367b7775" rel="noreferrer noopener nofollow">[email protected]</a>   Male
8   Lorrie  Roony   <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8c4dac7c7c6d19fe8c7dac9cbc4cd86cbc7c5" rel="noreferrer noopener nofollow">[email protected]</a>  Male
9   Hodge   Straun  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="29415a5d5b485c4711695e1a07465b4e" rel="noreferrer noopener nofollow">[email protected]</a> Male
10  Gawain  Tomblett    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4b2c3f242629272e3f3f720b3f243b2722383f652831" rel="noreferrer noopener nofollow">[email protected]</a>   Male
11  Carey   Facher  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a6c5c0c7c5cec3d4c7e6c5c788c1c9d0" rel="noreferrer noopener nofollow">[email protected]</a> Male
12  Pamelina    Elijahu <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6010050c090a0108150220070f0f4e0e054e0a10" rel="noreferrer noopener nofollow">[email protected]</a> Female
13  Carmelle    Dabs    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cdaea9acafbeae8dafa4b7a7a2b8bfa3aca1bee3aea2a0" rel="noreferrer noopener nofollow">[email protected]</a>  Female
14  Moore   Baldrick    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="abc6c9cac7cfd9c2c8c0cfebd2cac5cfced385d9de" rel="noreferrer noopener nofollow">[email protected]</a>    Male
15  Sheff   Morin   <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0a7967657863646f4a7a7f786f7c65667f676f24696567" rel="noreferrer noopener nofollow">[email protected]</a>  Male
16  Zed Eary    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="09736c687b706f4965607f6c63667c7b676865276a6664" rel="noreferrer noopener nofollow">[email protected]</a>  Male
17  Angus   Pollastrone <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="81e0f1eeedede0f2f5f3eeefe4e6c1f6e8eae8f2f1e0e2e4f2afe2eeec" rel="noreferrer noopener nofollow">[email protected]</a>    Male
18  Moises  Hubach  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e4898c918685878c8ca491978a819397ca878b89" rel="noreferrer noopener nofollow">[email protected]</a> Male
19  Lilllie Beetham <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fb97999e9e8f939a9692bb9f92929c94d5989496" rel="noreferrer noopener nofollow">[email protected]</a> Female
20  Mortimer    De Hooge    <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="98f5fcfdf0f7f7fffdf2d8edfbf7e2b6fbf7f5" rel="noreferrer noopener nofollow">[email protected]</a>  Male

源表包含超过100M的记录。

这是我正在使用的 hql。

insert overwrite table employee_stage
PARTITION (batch_id)
SELECT
  id,
  first_name,
  latst_name,
  email,
  gender,
  123456789 as batch_id
FROM employee;

数据位于单个分区中。

请告诉我在这种情况下我该如何提高性能? 有没有办法在所有 reducer 之间平均分配行?

最佳答案

我想您没有在 insert overwrite 查询中进行 JOINS 或其他一些繁重的转换,并且在插入期间确实发生了倾斜。因为如果你这样做,那么问题就不应该是关于插入的。

尝试将 distribute by batch_id 添加到您的插入查询中并重新运行。如果仍然存在偏差,请检查您的数据。某些特定 batch_id 的数据太多,或者可能有很多空值。处理倾斜数据有不同的方法。其中之一是过滤掉倾斜的 key 并单独加载它们。检查作业跟踪器上长时间运行的 reducer 日志,它将为您提供有关问题所在的更多信息。

关于performance - 如何强制 hive 在 reducer 之间从另一个表插入覆盖到分区表中时均匀分布行以提高性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44039475/

相关文章:

swift - 逐行解析文本的最快方法

python - 如何优化具有带条件的嵌套列表的Python代码?

database - Impala 分区查询运行缓慢

Python - 使用子进程的 Hive 命令 - 空结果

hadoop - 在 Oozie 中,如何将查询的输出重定向到文件?

mysql - 如何在非常大的 MySQL 表上提高 INSERT 性能

vba - 优化 VBA Excel 中双循环的性能

hive - 如何查找 Hive 表的更新日期?

sql - 如何在 HIVE 中查找以前的日期

hadoop - 如何在另一个配置单元查询中使用配置单元查询的输出?