java - 在Sqoop文件导入中,我想使用定义的映射器在文件拆分中控制导入的数据

标签 java hadoop sqoop

MySQL->从员工中选择*

empno | empname      | salary 
======================================================
|   101 | Ram          |   5000 |    
|   102 | Hari         |   7000 |   
|   104 | Vamshi       |   7000 |   
|   103 | Revathy      |   7000 |  
|   105 | Jaya         |   9000 |  
|   106 | Suresh       |   8000 |  
|   107 | Ramesh       |   9000 |  
|   108 | Prasana      |  10000 |  
|   109 | Ramsamy      |  20000 |  
|   110 | Singaram     |  30000 |  
|   200 | ramanathan   |  30000 |  
|   201 | Victor       |  33000 |  
|   202 | Naveen       |  33000 |  
|   203 | Karthik      |  33000 |  
|   204 | Karthikeyan  |  33000 |   
|   205 | Somasundaram |  43000 |   
|   301 | Test1        |  50000 |   
|   302 | Test2        |  60000 |   
|   303 | Test3        |  70000 

Command in Sqoop

sqoop import  --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee 
--direct --verbose
 --split-by salary 

By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
 3 records in second file,
 3 records in third file & 3 records in last file

    5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
 BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee

15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [      
                 5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000

15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4

我想知道它如何对每个文件中的记录数进行分类。是可定制的。

最佳答案

工资范围是5000 - 70000 (i.e. min 5000, max 70000)。薪金分为薪金4类。

(70000 - 5000 )/4=16250

因此,
split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)

关于java - 在Sqoop文件导入中,我想使用定义的映射器在文件拆分中控制导入的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31205275/

相关文章:

java - 当我不知道泛型对象的类型时,如何避免 Java 中未经检查的方法警告?

csv - 如何使用外部架构(如 Avro)将 CSV 文件导入 HIVE?

java - 在hadoop中将文件作为单个记录读取

hadoop - sqoop从mysql导入到hbase的问题

oracle - 使用Sqoop从Oracle导入到HBase时修改数据

java - hibernate/jpa 元模型类不包含所有字段

Java 反射和线程安全

java - Java中使用notify时,thread和runnable有什么区别吗?

hadoop - DFS ls命令是否从fsimage文件或两者(编辑日志和fsimage)中读取 namespace ?

hadoop - 是否有其他方法可以使此(--incremental lastmodified)与--hive-imports一起使用