java - 在Sqoop文件导入中，我想使用定义的映射器在文件拆分中控制导入的数据

MySQL->从员工中选择*

empno | empname      | salary 
======================================================
|   101 | Ram          |   5000 |    
|   102 | Hari         |   7000 |   
|   104 | Vamshi       |   7000 |   
|   103 | Revathy      |   7000 |  
|   105 | Jaya         |   9000 |  
|   106 | Suresh       |   8000 |  
|   107 | Ramesh       |   9000 |  
|   108 | Prasana      |  10000 |  
|   109 | Ramsamy      |  20000 |  
|   110 | Singaram     |  30000 |  
|   200 | ramanathan   |  30000 |  
|   201 | Victor       |  33000 |  
|   202 | Naveen       |  33000 |  
|   203 | Karthik      |  33000 |  
|   204 | Karthikeyan  |  33000 |   
|   205 | Somasundaram |  43000 |   
|   301 | Test1        |  50000 |   
|   302 | Test2        |  60000 |   
|   303 | Test3        |  70000 

Command in Sqoop

sqoop import  --connect jdbc:mysql://<hostname>/test --username <username> --password <password> --table employee 
--direct --verbose
 --split-by salary 

By giving above command, it takes min(salary), max(salary) and moves to HDFS table by 10 records in first file,
 3 records in second file,
 3 records in third file & 3 records in last file

    5/07/03 17:32:37 INFO db.DataDrivenDBInputFormat:
 BoundingValsQuery: SELECT MIN(`salary`), MAX(`salary`) FROM employee

15/07/03 17:32:37 DEBUG db.IntegerSplitter: Splits: [      
                 5,000 to 70,000] into 4 parts
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 5,000

15/07/03 17:32:37 DEBUG db.IntegerSplitter: 21,250
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 37,500
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 53,750
15/07/03 17:32:37 DEBUG db.IntegerSplitter: 70,000
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 5000' and upper bound '`salary` < 21250'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 21250' and upper bound '`salary` < 37500'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 37500' and upper bound '`salary` < 53750'
15/07/03 17:32:37 DEBUG db.DataDrivenDBInputFormat: Creating input split with lower bound '`salary` >= 53750' and upper bound '`salary` <= 70000'
15/07/03 17:32:37 INFO mapreduce.JobSubmitter: number of splits:4

我想知道它如何对每个文件中的记录数进行分类。是可定制的。

最佳答案

工资范围是5000 - 70000 (i.e. min 5000, max 70000)。薪金分为薪金4类。

(70000 - 5000 )/4=16250

因此，

split 1 : from 5000 to 21,250(=5000+16250)
split 2 : from 21250 to 37500(=21250+16250)
split 3 : from 37500 to 53750(=37500+16250)
split 4 : from 53750 to 70000(=53750+16250)

关于java - 在Sqoop文件导入中，我想使用定义的映射器在文件拆分中控制导入的数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31205275/

java - 在Sqoop文件导入中，我想使用定义的映射器在文件拆分中控制导入的数据

上一篇：java - Mapreduce链接作业因异常而失败

下一篇：java - Apache Crunch是否随附Hadoop MapReduce API？