hadoop - hbase 中的数据大小增加

我正在尝试使用 sqoop 将数据从 MySQL 导入到 HBase。 MySQL 表中大约有 900 万条记录，大小接近 1.2GB。 hadoop集群的复制因子为3。
以下是我面临的问题:

导入hbase后的数据大小超过20GB!!!理想情况下应该接近，比如 5GB(1.2G*3 + 一些开销)
HBase 表的 VERSIONS 定义为 1。如果我导入相同的再次来自 MySQL 的表，/hbase/中的文件大小增加(几乎翻倍)。尽管 HBase 表中的行数保持不变。这看起来很奇怪，因为我在 HBase，因此文件大小应该保持不变，类似于行计数值。

据我所知，如果我导入相同的行集，则第二种情况下的文件大小不应增加，因为为每个条目维护的最大版本应该仅为一个。

如有任何帮助，我们将不胜感激。

最佳答案

这取决于，根据这个blog

So to calculate the record size: Fixed part needed by KeyValue format = Key Length + Value Length + Row Length + CF Length + Timestamp + Key Value = ( 4 + 4 + 2 + 1 + 8 + 1) = 20 Bytes

Variable part needed by KeyValue format = Row + Column Family + Column Qualifier + Value

Total bytes required = Fixed part + Variable part

So for the above example let's calculate the record size: First Column = 20 + (4 + 4 + 10 + 3) = 41 Bytes Second Column = 20 + (4 + 4 + 9 + 3) = 40 Bytes Third Column = 20 + (4 + 4 + 8 + 6) = 42 Bytes

Total Size for the row1 in above example = 123 Bytes

To Store 1 billion such records the space required = 123 * 1 billion = ~ 123 GB

我认为您的计算完全不正确，也许与我们分享您的架构设计，我们可以计算出数学。

关于hadoop - hbase 中的数据大小增加，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18656483/

上一篇：hadoop - 如何跳过hadoop map-reduce中的坏记录

下一篇：java - HBase 映射减少 : write into HBase in Reducer

hadoop - 是否有任何人在cloudera实现或开发impala？

hadoop - 数据存储在 HDFS 中的什么位置？有没有办法改变它的存储位置？

hadoop - 是否可以在导入之前在新文件上编写带有过滤器的 Sqoop 增量导入？

hadoop - Apache Pig 等效于 Select *

hadoop - 是否有可用于 lzo 压缩二进制数据的 Scalding 源？

hadoop - Hive 如何查询转换后的变量；失败 : SemanticException [Error 10004]

hadoop - 将整个数据库从一个hbase导出到另一个

scala - 在 Spark 作业中写入 HBase : a conundrum with existential types

java - HDFS-仅在完全复制文件后读取文件