我已经使用三台机器(192.168.122.21、192.168.122.147 和 192.168.122.148)设置了一个副本集,并且我正在使用 Java SDK 与 MongoDB 集群进行交互:
ArrayList<ServerAddress> addrs = new ArrayList<ServerAddress>();
addrs.add(new ServerAddress("192.168.122.21", 27017));
addrs.add(new ServerAddress("192.168.122.147", 27017));
addrs.add(new ServerAddress("192.168.122.148", 27017));
this.mongoClient = new MongoClient(addrs);
this.db = this.mongoClient.getDB(this.db_name);
this.collection = this.db.getCollection(this.collection_name);
建立连接后,我会多次插入一个简单的测试文档:
for (int i = 0; i < this.inserts; i++) {
try {
this.collection.insert(new BasicDBObject(String.valueOf(i), "test"));
} catch (Exception e) {
System.out.println("Error on inserting element: " + i);
e.printStackTrace();
}
}
模拟主服务器节点崩溃(断电)时,MongoDB集群进行成功的故障转移:
19:08:03.907+0100 [rsHealthPoll] replSet info 192.168.122.21:27017 is down (or slow to respond):
19:08:03.907+0100 [rsHealthPoll] replSet member 192.168.122.21:27017 is now in state DOWN
19:08:04.153+0100 [rsMgr] replSet info electSelf 1
19:08:04.154+0100 [rsMgr] replSet couldn't elect self, only received -9999 votes
19:08:05.648+0100 [conn15] replSet info voting yea for 192.168.122.148:27017 (2)
19:08:10.681+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:10.910+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:16.394+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:22.876+.
19:08:22.912+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:23.623+0100 [SyncSourceFeedbackThread] replset setting syncSourceFeedback to 192.168.122.148:27017
19:08:23.917+0100 [rsHealthPoll] replSet member 192.168.122.148:27017 is now in state PRIMARY
客户端的 MongoDB 驱动程序也可以识别这一点:
Dec 01, 2014 7:08:16 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: Read timed out
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.SocketTimeoutException: connect timed out
Dec 01, 2014 7:08:36 PM com.mongodb.DBTCPConnector setMasterAddress
WARNING: Primary switching from /192.168.122.21:27017 to /192.168.122.148:27017
但它仍然不断尝试连接到旧节点(永远):
Dec 01, 2014 7:08:50 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
.....
Dec 01, 2014 7:10:43 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException -message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
从主数据库发生故障并且辅助数据库成为主数据库的那一刻起,数据库上的文档计数保持不变。以下是该过程中同一节点的输出:
"rs0":SECONDARY> db.test_collection.find().count() 12260161
"rs0":PRIMARY> db.test_collection.find().count() 12260161
更新: 使用 WriteConcern Unacknowledged 它可以按设计工作。插入操作也在新的master上进行,选举过程中的所有操作都会丢失。
通过 WriteConcern Acknowledged,操作似乎无限期地等待来自崩溃主设备的 ACK。这可以解释为什么在崩溃的服务器再次启动并作为辅助服务器加入集群后程序仍能继续运行。但就我而言,我不希望驱动程序永远等待,它应该在一定时间后引发错误。
更新: 在终止主服务器上的 mongod 进程时,WriteConcern Acknowledged 也按预期工作。在这种情况下,故障转移仅需约 3 秒。在此期间不会执行任何插入操作,在选出新的主节点后,插入操作将继续。
所以我只有在模拟节点故障(断电/网络关闭)时才会遇到问题。在这种情况下,操作将挂起,直到故障节点再次启动。
最佳答案
您的应用程序还能运行吗?由于该服务器仍在您的种子列表中,因此据我所知,驱动程序将尝试连接到它。只要种子列表中的任何其他服务器可以获得主要状态,您的应用就应该仍然可以运行。
关于java - MongoDB SDK 故障转移不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27235072/