hadoop - newAPIHadoopRDD 从 HBase 读取耗时过多(主要原因是 Dns.reverse Dns)

标签 hadoop apache-spark dns hbase apache-zookeeper

最近,当我使用 Spark 和 HBase 测试我的集群时。我正在使用 newAPIHadoopRDD 从 HBase 表中读取记录。我发现newAPIHadoopRDD太慢了,时间和Region Server的数量成正比。

下面的 spark debug(opened for test) 日志显示了过程:

17/03/02 22:00:30 DEBUG AbstractRpcClient: Use SIMPLE authentication for service ClientService, sasl=false
17/03/02 22:00:30 DEBUG AbstractRpcClient: Connecting to slave111/192.168.10.111:16020
17/03/02 22:00:30 DEBUG ClientCnxn: Reading reply sessionid:0x15a8de8a86f0444, packet:: clientPath:null serverPath:null finished:false header:: 5,3  replyHeader:: 5,116079898,0  request:: '/hbase,F  response:: s{116070329,116070329,1488462020202,1488462020202,0,16,0,0,0,16,116070652} 
17/03/02 22:00:30 DEBUG ClientCnxn: Reading reply sessionid:0x15a8de8a86f0444, packet:: clientPath:null serverPath:null finished:false header:: 6,4  replyHeader:: 6,116079898,0  request:: '/hbase/master,F  response:: #ffffffff000146d61737465723a3136303030fffffff4ffffffa23affffffc8ffffffb6ffffffb1ffffffc21a50425546a12a66d617374657210ffffff807d18ffffffcffffffff4fffffffffffffff9ffffffa82b10018ffffff8a7d,s{116070348,116070348,1488462021202,1488462021202,0,0,0,97546372339663909,54,0,116070348} 
17/03/02 22:00:30 DEBUG AbstractRpcClient: Use SIMPLE authentication for service MasterService, sasl=false
17/03/02 22:00:30 DEBUG AbstractRpcClient: Connecting to master/192.168.10.100:16000
17/03/02 22:00:30 DEBUG RegionSizeCalculator: Region tt,3,1488442069431.21d34666d310df3f180b2dba093d910d. has size 0
17/03/02 22:00:30 DEBUG RegionSizeCalculator: Region tt,,1488442069431.cb8696957957f824f1a16210768bf197. has size 0
17/03/02 22:00:30 DEBUG RegionSizeCalculator: Region tt,1,1488442069431.274ddaa4abb34f0408cac0f33107529c. has size 0
17/03/02 22:00:30 DEBUG RegionSizeCalculator: Region tt,2,1488442069431.05dd84aacb7f2587e325c8baf4c27613. has size 0
17/03/02 22:00:30 DEBUG RegionSizeCalculator: Region sizes calculated
17/03/02 22:00:38 DEBUG Client: IPC Client (480943798) connection to master/192.168.10.100:9000 from hadoop: closed
17/03/02 22:00:38 DEBUG Client: IPC Client (480943798) connection to master/192.168.10.100:9000 from hadoop: stopped, remaining connections 0
17/03/02 22:00:43 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:00:56 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:01:00 DEBUG TableInputFormatBase: getSplits: split -> 0 -> HBase table split(table name: tt, scan: , start row: , end row: 1, region location: slave104)
17/03/02 22:01:10 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:01:23 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:01:30 DEBUG TableInputFormatBase: getSplits: split -> 1 -> HBase table split(table name: tt, scan: , start row: 1, end row: 2, region location: slave102)
17/03/02 22:01:37 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:01:50 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:02:00 DEBUG TableInputFormatBase: getSplits: split -> 2 -> HBase table split(table name: tt, scan: , start row: 2, end row: 3, region location: slave112)
17/03/02 22:02:03 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:02:17 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:02:30 DEBUG ClientCnxn: Got ping response for sessionid: 0x15a8de8a86f0444 after 0ms
17/03/02 22:02:30 DEBUG TableInputFormatBase: getSplits: split -> 3 -> HBase table split(table name: tt, scan: , start row: 3, end row: , region location: slave108)
17/03/02 22:02:30 INFO ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
17/03/02 22:02:30 INFO ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15a8de8a86f0444
17/03/02 22:02:30 DEBUG ZooKeeper: Closing session: 0x15a8de8a86f0444
17/03/02 22:02:30 DEBUG ClientCnxn: Closing client for session: 0x15a8de8a86f0444
17/03/02 22:02:30 DEBUG ClientCnxn: Reading reply sessionid:0x15a8de8a86f0444, packet:: clientPath:null serverPath:null finished:false header:: 7,-11  replyHeader:: 7,116080795,0  request:: null response:: null
17/03/02 22:02:30 DEBUG ClientCnxn: Disconnecting client for session: 0x15a8de8a86f0444
17/03/02 22:02:30 INFO ZooKeeper: Session: 0x15a8de8a86f0444 closed
17/03/02 22:02:30 INFO ClientCnxn: EventThread shut down
17/03/02 22:02:30 DEBUG AbstractRpcClient: Stopping rpc client
17/03/02 22:02:30 DEBUG ClientCnxn: An exception was thrown while closing send thread for session 0x15a8de8a86f0444 : Unable to read additional data from server sessionid 0x15a8de8a86f0444, likely server has closed socket
17/03/02 22:02:30 DEBUG ClosureCleaner: +++ Cleaning closure <function1> (org.apache.spark.rdd.RDD$$anonfun$count$1) +++

我使用的是 Spark 2.1.0、HBase 1.1.2。 getSplits 操作花费了太多时间。测试region server数量从1个到4个,每个region server耗时30秒。 HBase 表不包含任何记录(仅供测试)。

这正常吗?有没有人遇到和我一样的问题?

测试代码如下:

Configuration hconf = HBaseConfiguration.create();
hconf.set(TableInputFormat.INPUT_TABLE, GLOBAL.TABLE_NAME);
hconf.set("hbase.zookeeper.quorum", "192.168.10.100");
hconf.set("hbase.zookeeper.property.clientPort", "2181");
Scan scan = new Scan();

JavaPairRDD<ImmutableBytesWritable, Result> results
        = sc.newAPIHadoopRDD(hconf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

long cnt = results.count();
System.out.println(cnt);

编辑

用HBase源码调试后,找到了速度慢的原因。来自 TableInputFormatBase.java反向DNS 操作 是罪魁祸首。

ipAddressString = DNS.reverseDns(ipAddress, null);

现在如何解决这个问题?我可以在 HBase 配置中添加一些 dns-ip 对吗?

最佳答案

我用nslookup反向查找192.168.10.100得到如下结果

;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

所以,我执行了下面的命令,

sudo iptables -t nat -A POSTROUTING -s 192.168.10.0/24 -o em4 -j MASQUERADE
sudo sysctl -w net.ipv4.ip_forward=1
sudo route add default gw 'mygatway' em4

那么,问题就解决了。

关于hadoop - newAPIHadoopRDD 从 HBase 读取耗时过多(主要原因是 Dns.reverse Dns),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42558427/

相关文章:

java - 如何连接到远程HDFS

ssh - 创建spark集群,无需ssh无密码访问

apache-spark - num-executors 可以覆盖 Spark-Submit 中的动态分配吗

node.js - 在具有多个域的同一服务器上托管 PHP 和 Node.js 应用程序

azure - 如何以我的方式在 Windows Azure 上部署 Wordpress Multisite

hadoop - hive -如何选择上一季度的最高日期

hadoop - 更改 Hadoop 作业的名称?

hadoop - 启动namenode和datanode时出错

scala - select/drop 并没有真正删除该列?

javascript - 使用 javascript 从通过 JSON 返回的域获取 TXT 记录