mysql - 使用 import-all-tables 优化从 MySQL 到 Hive 的 Sqoop 数据导入

我正在使用 Sqoop 1.4.6 通过 import-all-tables 选项将数据从 MySQL 导入 Hive。结果还可以，但是导入过程本身很慢。例如，其中一个数据库包含 40-50 个表，总行数远低于 100 万行，大约需要 25-30 分钟 才能完成。经过调查，似乎大部分时间都花在为每个导入的表初始化 Hive 上。在同一数据库上测试一个普通的 mysqldump 在不到 1 分钟内完成。所以问题是如何减少这个初始化时间，如果是这样的话，例如使用单个 Hive session ？

导入命令为:

sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true --compress --compression-codec=snappy --num-mappers 1 --connect "jdbc:mysql://..." --username ... --password ... --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --hive-import --hive-overwrite --hive-database ... --as-textfile --exclude-tables ... --warehouse-dir=...

更新:

Sqoop 版本:1.4.6.2.5.3.0-37

hive 版本:1.2.1000.2.5.3.0-37

可能与:

https://issues.apache.org/jira/browse/HIVE-10319

最佳答案

删除选项 --num-mappers 1使用默认 4 个映射器运行导入 OR 将其更改为更高的数字 --num-mappers 8 (如果硬件允许)- 这是为具有主键的表运行带有更多并行作业的导入，AND 使用 --autoreset-to-one-mapper 选项 - 它将为没有主键的表使用 1 个映射器。也可以使用 --direct 模式:

sqoop import-all-tables \
--connect "jdbc:mysql://..." --username ... \
--password ... \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--compress --compression-codec=snappy \
--num-mappers 8 \
--autoreset-to-one \ 
--direct \
--null-string '\\N' 
...

让我们知道这是否会提高性能...

更新:

--fetch-size=<n> - Where represents the number of entries that Sqoop must fetch at a time. Default is 1000.

Increase the value of the fetch-size argument based on the volume of data that need to read. Set the value based on the available memory and bandwidth.

increasing mapper memory from current value to some higher number: example: sqoop import-all-tables -D mapreduce.map.memory.mb=2048 -D mapreduce.map.java.opts=-Xmx1024m <sqoop options>

Sqoop Performance Tuning Best Practices

在 JDBC 连接或 Sqoop 映射中调整以下 Sqoop 参数以优化性能

批量(用于导出)
拆分和边界查询(不需要，因为我们正在起诉--autoreset-to-one-mapper , 不能与 import-all-tables 一起使用)
直接
获取大小
数字映射器

关于mysql - 使用 import-all-tables 优化从 MySQL 到 Hive 的 Sqoop 数据导入，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42249962/

mysql - 使用 import-all-tables 优化从 MySQL 到 Hive 的 Sqoop 数据导入

上一篇：hadoop - Apache Spark : NPE during restoring state from checkpoint

下一篇：hadoop - 如何重命名配置单元中的所有分区列