环境:

Ubuntu14.04 + Postgresql9.4.

以下是我的设置:(“->”表示物理流复制 PSR)

Master1 -> Slave1 (primary) -> Slave2

此行为正确 - Master1 上的更改反射(reflect)在 Slave1 上，然后反射(reflect)在 Slave2 上。

如果我禁用Master1，并使用trigger_file将Slave1提升为Master，则Slave1将成功提升 - 我可以写入Slave1。

但是，新升级的 Slave1 和 Slave2 之间的复制停止。

这是预期的行为吗？我期望复制能够像这样继续:

Slave1 -> Slave2

这样对 Slave1 的写入就会反射(reflect)在 Slave2 中

更新

日志:

Slave1 提升:

2017-10-03 16:43:20 BST  @ LOCATION:  libpqrcv_connect, libpqwalreceiver.c:107
2017-10-03 16:43:25 BST  @ FATAL:  XX000: could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host "192.168.20.55" and accepting
        TCP/IP connections on port 5432?

2017-10-03 16:43:25 BST  @ LOCATION:  libpqrcv_connect, libpqwalreceiver.c:107
2017-10-03 16:43:30 BST  @ LOG:  00000: trigger file found: /var/lib/postgresql/9.4/main/failover_trigger.5432
2017-10-03 16:43:30 BST  @ LOCATION:  CheckForStandbyTrigger, xlog.c:11440
2017-10-03 16:43:30 BST  @ LOG:  00000: redo done at 0/19000740
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7032
2017-10-03 16:43:30 BST  @ LOG:  00000: last completed transaction was at log time 2017-10-03 16:41:23.430752+01
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7037
2017-10-03 16:43:30 BST  @ LOG:  00000: selected new timeline ID: 2
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7153
2017-10-03 16:43:30 BST  @ LOG:  00000: archive recovery complete
2017-10-03 16:43:30 BST  @ LOCATION:  exitArchiveRecovery, xlog.c:5459
2017-10-03 16:43:30 BST  @ LOG:  00000: MultiXact member wraparound protections are now enabled
2017-10-03 16:43:30 BST  @ LOCATION:  DetermineSafeOldestOffset, multixact.c:2619
2017-10-03 16:43:30 BST  @ LOG:  00000: database system is ready to accept connections
2017-10-03 16:43:30 BST  @ LOCATION:  reaper, postmaster.c:2795
2017-10-03 16:43:30 BST  @ LOG:  00000: autovacuum launcher started
2017-10-03 16:43:30 BST  @ LOCATION:  AutoVacLauncherMain, autovacuum.c:431

从机2

2017-10-03 16:43:30 BST  @ LOG:  00000: replication terminated by primary server
2017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.
2017-10-03 16:43:30 BST  @ LOCATION:  WalReceiverMain, walreceiver.c:446
2017-10-03 16:43:30 BST  @ LOG:  00000: fetching timeline history file for timeline 2 from primary server
2017-10-03 16:43:30 BST  @ LOCATION:  WalRcvFetchTimeLineHistoryFiles, walreceiver.c:669
2017-10-03 16:43:30 BST  @ LOG:  00000: record with zero length at 0/190007A8
2017-10-03 16:43:30 BST  @ LOCATION:  ReadRecord, xlog.c:4184
2017-10-03 16:43:30 BST  @ LOG:  00000: restarted WAL streaming at 0/19000000 on timeline 1
2017-10-03 16:43:30 BST  @ LOCATION:  WalReceiverMain, walreceiver.c:374
2017-10-03 16:43:30 BST  @ LOG:  00000: replication terminated by primary server
2017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.

从机1 IP:

192.168.20.56

从机2 IP:

192.168.20.53

pg_hba.conf允许Slave2连接到Slave1进行复制:

Slave1 pg_hba.conf 段:

host    replication     replication     192.168.20.53/32        trust

从机1恢复.完成:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.55 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
trigger_file = '/var/lib/postgresql/9.4/main/failover_trigger.5432'

Slave2 recovery.conf:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.56 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'

非常感谢对此的任何帮助。

更新及解决方案

感谢@Vao Tsun的回答，在Slave2 recovery.conf中将recovery_target_timeline设置为“latest”，并重新启动Slave2 postgresql服务器(而不是重新加载)允许复制过程重新启动:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.56 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
recovery_target_timeline = 'latest'

最佳答案

你在slave1日志中看到:

2017-10-03 16:43:30 BST  @ LOG:  00000: selected new timeline ID: 2

在slave2中:

017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.

所以slave2在升级后并没有切换到时间线2。

正如我在评论中所说，您需要在slave2 recovery.conf中recovery_target_timeline='latest'

https://www.postgresql.org/docs/current/static/recovery-target-settings.html

recovery_target_timeline (string) Specifies recovering into a particular timeline. The default is to recover along the same timeline that was current when the base backup was taken. Setting this to latest recovers to the latest timeline found in the archive, which is useful in a standby server. Other than that you only need to set this parameter in complex re-recovery situations, where you need to return to a state that itself was reached after a point-in-time recovery. See Section 25.3.5 for discussion.

关于Postgresql 9.4 级联复制故障转移，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46529323/

Postgresql 9.4 级联复制故障转移

环境:

更新