当我将批量插入仅发送到一个表,而每一行作为唯一键且条件不存在时,即使其中一行存在,也会出现问题。
我需要每行插入批处理,而不是整个批处理。 假设我有一个表“users”,只有一列“user_name”并包含行“jhon”,现在我正在尝试导入新用户:
BEGIN BATCH
INSERT INTO "users" ("user_name") VALUES ("jhon") IF NOT EXISTS;
INSERT INTO "users" ("user_name") VALUES ("mandy") IF NOT EXISTS;
APPLY BATCH;
它不会插入“mandy”,因为“jhon”存在,我该怎么做才能隔离它们?
我有很多行要插入大约 100-200K,所以我需要使用批处理。
谢谢!
最佳答案
首先:您所描述的内容被记录为预期行为:
In Cassandra 2.0.6 and later, you can batch conditional updates introduced as lightweight transactions in Cassandra 2.0. Only updates made to the same partition can be included in the batch because the underlying Paxos implementation works at the granularity of the partition. You can group updates that have conditions with those that do not, but when a single statement in a batch uses a condition, the entire batch is committed using a single Paxos proposal, as if all of the conditions contained in the batch apply.
这基本上证实了:您的更新是针对不同的分区,因此只会使用一个 Paxos 提案,这意味着整个批处理都会成功,或者都不会成功。
也就是说,对于 Cassandra,批处理并不是为了加速和批量加载,而是为了创建伪原子逻辑操作。来自 http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html :
Batches are often mistakenly used in an attempt to optimize performance. Unlogged batches require the coordinator to manage inserts, which can place a heavy load on the coordinator node. If other nodes own partition keys, the coordinator node needs to deal with a network hop, resulting in inefficient delivery. Use unlogged batches when making updates to the same partition key.
The coordinator node might also need to work hard to process a logged batch while maintaining consistency between tables. For example, upon receiving a batch, the coordinator node sends batch logs to two other nodes. In the event of a coordinator failure, the other nodes retry the batch. The entire cluster is affected. Use a logged batch to synchronize tables, as shown in this example:
在您的架构中,每个 INSERT 都针对不同的分区,这将给您的协调器增加大量负载。
您可以使用具有异步执行的客户端来运行 200k 插入,并且它们运行得相当快 - 可能与您在批处理中看到的一样快(或更快)。
关于Cassandra 批处理如果不存在条件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29909050/