hadoop - AWS Hive + Kinesis on EMR = 了解检查点

我有一个 AWS Kinesis 流，我在 Hive 中创建了一个指向它的外部表。然后，我为检查点创建了一个 DynamoDB 表，并在我的 Hive 查询中设置了以下属性，如 here: 所述

set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;                                                               
set kinesis.checkpoint.metastore.range.key.name=RangeKey;                                                            
set kinesis.checkpoint.logical.name=my_logical_name;                                                                 
set kinesis.checkpoint.iteration.no=0;

我有以下问题:

是否必须始终将 iteration.no 设置为 0？
这是否总是从脚本的开头开始(最旧的 Kinesis 记录将被驱逐)？
假设我设置了一个 cron 来安排脚本的执行，我如何检索“下一个”迭代次数？
要对相同的数据重新执行脚本，是否足以使用相同的执行编号重新运行查询？
如果我用 iteration.no=0 反复执行 select * from kinesis_ext_table limit 100，一旦第一个 Kinesis 记录开始，我会得到不同/奇怪的结果吗被驱逐？

给定 DynamoDB 检查点条目:

{"startSeqNo":"1234",
 "endSeqNo":"5678",
 "closed":false}

closed 字段的含义是什么？
序列号是递增的吗？开始和结束之间是否存在关系(例如:结束 - 开始 = 读取的记录数)？
我注意到有时只有 endSeqNum(没有 startSeqNum)，我该如何解释？

我知道有很多问题，但我无法在文档中找到这些答案。

最佳答案

查看 Kinesis documentation和 Kinesis Storage Handler Readme其中包含许多问题的答案。

Do I always have to start with iteration.no set to 0?

是的，除非您正在执行一些高级逻辑，要求您跳过流中已知或已处理的部分

Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?

是

Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?

这是由配置单元脚本处理的，因为它在每次运行时查询运动流中的所有数据

To re-execute the script on the same data, is it enough to re run the query with the same execution number?

由于 Kinesis 数据是一个 24 小时时间窗口，自上次查询以来数据(可能)发生了变化，因此您可能希望在 Hive 作业中再次查询所有记录

If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?

是的，您会期望结果随着流的变化而变化

Given the DynamoDB checkpoint entry: What's the meaning of the closed field?

虽然这是 Kinesis Storage Handler 的内部细节，但我认为这表明分片是否是父分片，这表明它是打开并接受新数据还是关闭且不接受新数据进入分片。如果您放大或缩小流，父分片会存在 24 小时，并包含自您缩放以来的所有数据，但是不会向这些分片中插入新数据。

Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?

新的序列号通常会随着时间的推移而增加，这是亚马逊就此提供的唯一指导。

I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?

这意味着分片是开放的并且仍在接受新数据(不是父分片)

关于hadoop - AWS Hive + Kinesis on EMR = 了解检查点，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30035344/

hadoop - AWS Hive + Kinesis on EMR = 了解检查点

上一篇：hadoop - 如何在分布式缓存中使用 MapReduce 输出

下一篇：hadoop - 无法在 hadoop 文件系统中创建目录