This doesn't sound like a big deal, until you think about a cluster with load balancing to 16 read-only replicas. Every one of those you restart breaks a bunch of application connections. Looking at how timeline switch works, it didn't seem like there was even a good reason for this; really, the only thing which seemed to be blocking it was that primary_conninfo comes from recovery.conf, which only gets read on startup. I'd hoped that the merger of recovery.conf and postgresql.conf would solve this, but that patch got punted to 9.5 due to conflicts with SET PERSISTENT.
So, I set out to find a workaround, as well as proving that it was only the deficiencies of recovery.conf preventing us from doing no-restart remastering. And I found one, thanks to the question someone asked me at pgDay NYC.
So, in the diagram above, M1 is the current master. M2 is a replica which is the designated failover target. R1 and R2 are additional replicas. "proxy" is a simple TCP proxy; in fact, I used a python proxy written in 100 lines of code for this test. You can't use a Postgres proxy like pgBouncer because it won't accept a replication connection.
Remastering time!
- Shut down M1
- Promote M2
- Restart the proxy, now pointing to M2
But: what happened to R1 and R2? Did they remaster without restarting? Yes, indeedy, they did!
LOG: entering standby mode
LOG: redo starts at 0/21000028
LOG: consistent recovery state reached at 0/210000F0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
FATAL: could not send end-of-streaming message to primary: no COPY in progress
FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "172.31.11.254" and accepting
TCP/IP connections on port 9999?
LOG: fetching timeline history file for timeline 7 from primary server
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
LOG: new target timeline is 7
LOG: restarted WAL streaming at 0/22000000 on timeline 7
LOG: redo starts at 0/21000028
LOG: consistent recovery state reached at 0/210000F0
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
FATAL: could not send end-of-streaming message to primary: no COPY in progress
FATAL: could not connect to the primary server: could not connect to server: Connection refused
Is the server running on host "172.31.11.254" and accepting
TCP/IP connections on port 9999?
LOG: fetching timeline history file for timeline 7 from primary server
LOG: started streaming WAL from primary at 0/22000000 on timeline 6
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 6 at 0/2219C1F0.
LOG: new target timeline is 7
LOG: restarted WAL streaming at 0/22000000 on timeline 7
Not only does this provide us a new remastering workaround for high-availability configurations on 9.3, it also shows us that as soon as we get around to merging recovery.conf with postgresql.conf, restarting to remaster can be eliminated.
This comment has been removed by the author.
ReplyDeleteI think there is another solution without any programming and any middle men. You can map the port 5432 on your master to some port (say 2345) on your replicas via SSH tunneling, and then make your replicas to connect to localhost 2345. And, after turning the first master off and promoting the second one, just stop the old SSH tunnels and start the new ones from the promoted master. That's it. And, BTW, I have some automation for SSH tunneling with a watchdog, if the latter matters.
ReplyDeleteOh, I forget to give the link to the automation I was talking about https://github.com/grayhemp/pgcookbook/blob/master/ssh_tunnel_with_compression_setup.md
DeleteThanks!
DeleteFor my part, I find a centrally managed TCP proxy to be substantially superior to separate stunnels for each replica, but do whatever works for you!
I wonder if using a VIP on the "current" master would work nice, as this is already a technique used for HA setups. Perhaps you'd need to kill the walreceiver, but looks feasible.
ReplyDelete