hadoop - Hbase排序效率

在我的 Hbasetable 中，我在第 100 行和第 4000 行有员工姓名“Simon”，我有另一个员工同名“Simon”。现在我想从我的 Employee 表中获取所有名为“Simon”的员工。行键是每个员工的 SSN。

我的问题是，如果我发起查询以获取所有名为“Simon”的员工。Hbase 中的搜索效率如何。因为第一个名字“simon”在第 100 行，第二个“simon”在第 4000 行。要找到名为“simon”的雇员，hbase 必须遍历所有表才能找到这个名字。搜索效率如何，因为我们在这种情况下进行全表扫描？

最佳答案

如果您必须进行全表扫描(您确实这样做了)，那将不是一个很好的解决方案。事实上，如果您有非常多的行，这将是一个糟糕的解决方案。

大多数关系数据库管理系统(或“SQL 数据库”)为解决此问题所做的是创建索引。由于您使用的是“NoSQL 数据库”，因此它不会自动为您创建索引。

让我们看看如何手动创建索引，以便有效地容纳特定类型的查询。

假设您有一组实体 S，其中 S 中的每个实体 E 都有一个唯一的键 K(E) 和属性值 V(E)。进一步假设您的实体位于 HBase 表中，每行一个，K(E) 作为每个实体 E 的行键。

S 相对于 V 的索引是另一种通常以三种形式之一出现的表。

索引表1

假设 V(E) 对于每个实体 E 也是唯一的。然后 S 相对于 V 的索引是一个表，每行一个实体，其中表有行键 V( E) 和包含 K(E) 的列。

要通过 V(E) 查找实体 E，只需转到该行查找 K(E)。

If your attribute values V(E) are unique, use this approach.

Think a table of Employee entities, where each employee has a unique EmployeeID within the company, K(E). The main Employee table could use the unique EmployeeID as the row key, and the Employee_SSN_Index could use the employee SSN number V(E) (which is also unique). This provide a fast lookup of employees by their SSN numbers.

索引表 2

假设 V(E) 对于每个实体 E 可能不是唯一的；也就是说，可能存在重复。那么 S 相对于 V 的索引是一个每行一个实体的表，其中表的行键为 V(E)++ K(E).

要使用 V(E) 查找所有实体 E，只需对以 V(E)< 开头的行进行前缀扫描.

There is a variant for the case when the length of V(E) is not fixed with and it may be impossible to distinguish the point at which V(E) ends and K(E) begins. A separator may be placed between V(E) and K(E) in the row key. For example V(E) ++ "|" ++ K(E). In this case, the prefix to scan is V(E) ++ "|".

A Employee_Department_Index table could use the DepartmentID an employee works in as the attribute value V(E).

索引表 3

假设 V(E) 对于每个实体 E 可能不是唯一的；也就是说，可能存在重复。那么 S 相对于 V 的索引是一个每行包含一组实体的表，其中该表的行键为 V(E) 和带有限定符 K(E) 的列族 F。也就是说，实体按属性值分组到行中。

要查找所有实体 E 和 V(E)，获取行 V(E) 请求列中的所有列家庭 F。

This approach should really be kept to the case where the number of entities in each row of the index is small.

关于hadoop - Hbase排序效率，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24947725/

hadoop - Hbase排序效率

索引表1

索引表 2

索引表 3

上一篇：eclipse - 在 Windows 上的 Eclipse 中运行 hadoop 应用程序时出错

下一篇：hadoop - 如果我使用 -mapper cat 而不是 -mapper org.apache.hadoop.mapred.lib.IdentityMapper，Hadoop Streaming 的性能会降低吗？