cassandra - 使用 nodetool 命令 cfstats 和 cfhistograms 结果了解 opscenter 指标

我正在研究 cassandra 集群的基准测试，因此使用 cassandra-stress 工具。能够在其中一张表中插入1M条记录，复制因子为2，CL为仲裁，线程数为40，在2个节点上，运行压力从11.43.600.66开始。

./cassandra-stress user profile= demo.yaml n=1000000 ops(insert=1,likelyquery0=2) cl= quorum -node 11.43.600.66,11.43.600.65 -rate threads=40

**demo.yaml script:**  
columnspec:  
  - name: user_name  
    size: gaussian(20..45)  
    population: gaussian(10000..50000)  
  - name: system_name  
    size: gaussian(20..45)  
    population: gaussian(50..60)  
  - name: time  
    size: uniform(15..25)  
    population: uniform(100000..1000000)  
  - name: request_uri  
    size: gaussian(50..80)  
    population: gaussian(100..150)  

insert:  
  partitions: fixed(1)            
  select:  fixed(1)/1000        
  batchtype: UNLOGGED

我试图理解nodetool cfstats、cfhistograms 与OpsCenter 的结果。 Opscenter 的写入和读取请求延迟 (ms/op) 的表级指标为:
WriteRequestLatency](http://[Writerequestlatencygraphs ReadRequestLatency](http://[ReadRequestLatencygraphs
cfhistograms 结果用于计算写入和读取延迟。延迟以微秒为单位
cfhistogramsstats](http://[cfhistogramsstats
cfstats 结果以毫秒为单位
cfstats](http://[cfstats results

a) As per the results of cfhistograms and cfstats  
Write Latency: 0.0117ms = 11.7 micros
Read Latency:  0.0943ms = 94.3 micros
This would approximately match the results at 50% as 
Write Latency: 10micros
Read Latency: 103micros

问题1:cfstats 和 cfhistograms 基于哪个百分位数显示结果？我总是会考虑 95%，但对于 95%，cfstats 结果与此处的 cfhistograms 不匹配。我考虑有什么问题吗？

b) From OpsCenter results:
Write Latency: 1.6ms/op = 1600 micros
Read Latency:  1.9ms/op = 1900 micros

问题2:为什么cfhistograms和opscenter的结果不匹配？是否像 opscenter y 轴写入、读取请求延迟值必须以 micros/op 而不是 ms/op 为单位？

Versions:
Cassandra 2.1.8.689
OpsCenter 5.2.2

如果我错了，请告诉我..!!
谢谢

最佳答案

这是两种不同类型的指标，它们的跟踪统计方式不同。

首先，集群读/写延迟是协调器 View ，包括可能的跨节点通信。如果将鼠标悬停在定义的指标上，则来自 opscenter:

The average response times (in milliseconds) of a client write. The time period starts when a node receives a client write request, and ends when the node responds back to the client. Depending on consistency level and replication factor, this may include the network latency from writing to the replicas.

在 cfhistograms 中，您可以查看该节点的本地延迟，这也保存在 OpsCenter 中的 CF: 或 TBL: 指标下(取决于版本)。有一个百分位数图实际上会显示这一点

The min, median, max, 90th, and 99th percentile of the response time to read data from the memtable and sstables for a specific table. The elapsed time from when the replica receives the request from a coordinator and sends a response.

因此，从这两个指标描述的角度来看，其读/写级别不同。

此外 - 用于衡量它们的统计数据是不同的。

平均延迟将用自上次检查以来协调器写入的总时间除以自上次检查以来协调器写入的数量(请参阅 https://github.com/apache/cassandra/blob/94ff639429a65acb5f122ed559e98dd60a40e42d/src/java/org/apache/cassandra/metrics/LatencyMetrics.java#L125 )。这可能与预期相去甚远，因为可能存在大量亚毫秒请求，而单个 30 秒请求的平均时间为 1 毫秒。

“更好”的指标仍然有一些统计损失，但在描述延迟分布方面要好得多。这些(cfhistograms opscenter 中的百分位数)通过表示桶中的延迟来更新，每个桶代表一个时间范围。该直方图会在请求期间更新。在 OpsCenter 中，它每分钟都会跟踪直方图的状态，并根据差异可以确定每个时间段内发生了多少个请求。这还允许在集群 View 中跨节点进行更统计准确的数据合并。如果一个节点有 1000 个请求，而另一个节点有 1 个请求，则取平均值将得到一半的结果。通过添加每个节点桶的总数，可以更好地表示实际的延迟分布。这里仍然有损失，但相对较小。每个桶代表一个范围，我们不知道该桶中的每个请求发生在该范围内的哪个位置，但它足够小，足以“足够好”，并且足够好地代表数量级。

Nodetool cfhistograms 有几个版本需要警惕。它使用了 https://en.wikipedia.org/wiki/Reservoir_sampling水库采样算法(vitters r)而不是直方图，其基于这样的思想:正态分布可以用较小的数据样本表示。不幸的是，延迟是一个非常重尾的非正态分布，很容易就会降低几个数量级。 https://issues.apache.org/jira/browse/CASSANDRA-8662

关于cassandra - 使用 nodetool 命令 cfstats 和 cfhistograms 结果了解 opscenter 指标，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34365264/

cassandra - 使用 nodetool 命令 cfstats 和 cfhistograms 结果了解 opscenter 指标

上一篇：azure-service-fabric - 删除 Service Fabric 群集

下一篇：yii - 如何在 Yii2 中将变量传递给模型？