- Users on version 9.3
- Users of binary replication
Annoyingly, you'll have to do some additional stuff after you update:
- If using binary replication, you need to take a new base backup of each replica after updating.
- You should run the following on each production database after updating:
VACUUM; -- optionally, ANALYZE as well
This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.
More information about the replication issue is here.
So, how did this happen? Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions. For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed. Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.
If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!
Josh,
ReplyDeleteA few years ago the Slony guys built a testing framework to help track down race conditions that were causing them grief. It might be worth reaching out to them to see if there is anything there that still exists and could be leveraged.
My doubt about the replication bug is about the warm/hot standby.
ReplyDeleteIs not clearly stated the bug doesn't affect the warm standby but as far as I see from the hacker's discussion this bug affects only the hot standby.
As I have very strict maintenance windows for now I've rebuilt my failovers and I'm keeping them in warm standby.
As all the failovers are 9.2.6, and as far I can see the issue is in the standby code, this should give me a peace of mind before the master's upgrade.
Ta!
Federico
Federico, make sure you have hot_standby=off in your postgresql.conf when restoring. hot_standby=on is what triggers the bug.
DeleteThanks :)
Delete> while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period
ReplyDeleteNote that you can also use vacuum_cost_delay (and other vacuum_cost_* settings) to throttle vacuum. Here's what I used:
SET vacuum_cost_delay = 5; SET vacuum_freeze_table_age = 0; VACUUM VERBOSE;