Database Soup: October 2012

Tuesday, October 30, 2012

pgConf.EU 2012 report

I just got back from my first European PostgreSQL conference since 2009. If you missed it, you should be sorry you did. Me, I mostly went because it was in Prague this year ... my sweetie really wanted to go, so off we went.

Nearly 300 people attended from what I heard, and the rooms were certainly full enough to support that estimate. I got to see a lot of people I don't see at pgCon. Highlights of the conference for me were:

Hallway Track: getting to hash stuff out with Joe Celko, Peter Geoghegan, Rob Napier, Hannu Krosing, and Laurenz Albe, whom I never see otherwise. Not all at the same time, of course!
Playing Chess in the DB: Gianni Colli's pgChess extension and supporting presentation were excellent. If you want evidence that you can do anything with Postgres, load up pgChess!
GSOC Students: Atri Sharma, Alexander Korotkov, and Qi Huang all came to the conference, got to meet the community, and presented on their projects from Google Summer of Code.
Lightning Talks: I love LTs, of course. Harald did a particularly good job running them.
Boat Party: Heroku rented a party boat to cruise up the Vltava River while we drank lots of Budvar and talked databases.

Of course, I presented stuff too: Elephants and Windmills, about some of our work in wind power, and I rehashed Aaron's PostgreSQL Drinking Game as a lightning talk at Dave's request. Follow links for slides.

Thanks to the whole pgConf organizing crew for a great time, and for helping with getting the students to the conference.

Not sure I'll make it next year; October is often a busy time for other conferences, and I think it's more important for me to speak at non-Postgres conferences than to our existing community. However, if they hold it in Barcelona, I might have to find a way to go.

Saturday, October 13, 2012

Freezing Your Tuples Off, Part 2

Continued from Part 1.

vacuum_freeze_min_age

The vacuum_freeze_min_age setting determines the youngest XID which will be changed to FrozenXID on data pages which are being vacuumed anyway. The advantage of setting it low is that far more XIDs will already be frozen when the data page is finally evicted from memory. This is ideal from a maintenance perspective, as the data page may never need to be read from disk for freezing. The disadvantage of setting it too low is additional time and CPU used during vacuum, especially if the page ends up being vacuumed several times before it's written out.

The other disadvantage of setting it low, according to the pgsql-hackers list, is that you will have less data for reconstructive database forensics if your database gets corrupted. However, given PostgreSQL's very low incidence of corruption bugs, this is not a serious consideration when balanced against the very high cost of a vacuum freeze on a really large table.

Therefore, we want the setting to be low, but not so low that XIDs used a few minutes ago are expiring. This is where things get difficult and not auto-tunable in current Postgres. You really want XIDs to freeze after a few hours, or maybe a day if your application makes use of long-running transactions frequently. But Postgres has no single-step way to determine how many XIDs you use per hour on average.

The best way is to monitor this yourself. In recent versions of PostgreSQL, you can get this from pg_stat_database, which has the counters xact_commit and xact_rollback. If these are part of a monitoring scheme you already have in place (such as Nagios/Cacti, Ganglia or Munin), then you can look at the transaction rates using those tools. If not, you need to follow these three steps:

1. Run this query, and write down the result:

SELECT txid_current();

2. Wait three or four hours (or a day, if you use long-running transactions)

3. Run the query again.

4. Subtract the number from the first query run from the second query run. If the 2nd number is lower, then you've wrapped around zero, and should try again.

5. Round up to the nearest multiple of 10.

So, as an example:

josh=# select txid_current();
txid_current
--------------
1000811
... wait four hours ...

postgres=# select txid_current();
sum
---------
2062747

So my transaction burn rate is 1,061,936 for four hours, which I round to 1,000,000. I then set vacuum_freeze_min_age to 1000000. The approximate burn rate, 250,000 per hour, is also a handy figure to keep to figure out when XID wraparound will happen next (in about 1 year, if I was starting from XID 3).

Without doing the above, it's fairly hard to estimate a reasonable level, given that XID burn rate depends not only on the amount of write activity you're doing, but how you're grouping the writes into transactions. For example, in a data collection database, if you're doing each imported fact as a separate standalone INSERT, you could be burning a million XIDs per hour, but if you're batching them in batches of a thousand rows, that cuts you down to 1000 XIDs per hour. That being said, if you really don't have time to check, here's my rules-of-thumb settings for vacuum_freeze_min_age:

Low write activity (100 per minute or less): 50000
Moderate write activity (100-500 per minute): 200000
High write activity (500 to 4000 per minute): 1000000
Very high write activity (higher than 4000 per minute): 10000000

You'll notice that all of these are lower than the default which ships in postgresql.conf. That default, 100 million, is overly conservative, and means that preemtive freezing almost never happens on databases which run with the defaults. Also, the default of 100m is half of 200m, the default for autovacuum_freeze_max_age, meaning that even after you've completely vacuumed an entire table, you're left with many XIDs which are 50% of freeze_max_age old. This causes more wraparound vacuums than are necessary.

Also, to some degree, this isn't worth worrying about below 500 writes/minute, given that it takes 8 years to reach XID wraparound at that rate. Few PostgreSQL installations go 8 years without a dump/reload.

To be continued in Part 3 ...

NOTE: The above blog entry was updated in April 2015 to reflect and improved method suggested by Noah below.

Tuesday, October 9, 2012

DVDStore Benchmark Tonight on SFPUG Live

Tonight Jignesh Shah of VMware will be presenting on running the DVDStore benchmark with PostgreSQL at SFPUG. As always, you can catch it live on video if you can't be there. No promises on video quality, though!

Postgres SMF Permissions on Joyent SmartOS

A few of our clients run on the Joyent cloud. Uniquely, Joyent runs SmartOS, an Illumos (i.e. OpenSolaris) distribution, which is the default and supported OS on VMs. Like Solaris 10 and 11, SmartOS controls server services using the Service Management Framework, or SMF, and XML-configuration-driven tool which supports version-specific start, stop, restart, reload and auto-vivification for PostgreSQL.

Now, for replication failover between SmartOS hosts on Joyent, one of the things we need to do is STONITH ("Shoot The Other Node In The Head") which includes shutting down Postgres on the old master if it's still running. For various setup reasons, our scripts need to do this as the "postgres" user, not as root. But, thanks to auto-vivification, if we just "pg_ctl stop", SMF will automatically restart PostgreSQL after a few minutes. So we needed to grant permission for the postgres user to enable, disable and restart the PostgreSQL service.

Unfortunately, permissions on services are a wierd alchemy of SMF configuration and RBAC, the Solaris security framework. After several hours of struggling with how to do this, Joyent support came to the rescue. Ryan Puckett gave me this formula:

echo "postgres.applications:::Manage Postgres::" >> /etc/security/auth_attr
echo "postgres::::type=normal;auths=postgres.applications" >> /etc/user_attr
svccfg -s postgresql setprop general/action_authorization = astring: postgres.applications
svccfg -s postgresql setprop general/value_authorization = astring: postgres.applications

And I'm delighted to report that, after running the above commands, the postgres user can run "/usr/sbin/svcadm disable postgresql". Yay!

Monday, October 8, 2012

Determining Furthest Ahead Replica

Here's a common situation for PostgreSQL binary replication setups with multiple replicas: the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss. Sadly, this is something we make stupidly hard to do in PostgreSQL 9.0, 9.1 and 9.2. While our project is great at making things possible, we're not so great at making them easy.

Checking Log Position

By "furthest ahead", we mean "which replica has received the most recent data from the master before the master went down?" The relevant place to check this is the replica's transaction log buffer, which is where data goes which has been received from the master.

When you start out, it looks simple. You check the last XLOG receive location on each replica, and take the "highest" one. Let's try that!

replica1=# select pg_last_xlog_receive_location();

        pg_last_xlog_receive_location 

       -------------------------------

        41A/10808DE8

replica2=# select pg_last_xlog_receive_location();

        pg_last_xlog_receive_location 

       -------------------------------

        41A/FFD1560

Uhhh ...

Yeah, not helpful at all, is it? We give you a completely undocumented pair of hexidecimal values. To make things even more fun, Postgres doesn't have any built-in functions for converting hexidecimal to numeric.

Checking in PostgreSQL 9.1

In 9.1 we added pg_last_xact_replay_timestamp(), which gives you a timestamp for the replica replay:

replica1=# select pg_last_xact_replay_timestamp();

 pg_last_xact_replay_timestamp 

-------------------------------

 2012-10-05 10:35:47.527-07

Yay! That solves everything. I can just take the replay timestamp from each replica and pick the latest one, right?

Well, no. Sorry.

First, that's the replay timestamp, and what we're more interested in is the received timestamp, which we can't get. WAL records are first received, and then replayed, so if a replica is under load the replay location can be behind the received location, sometimes considerably. When remastering, we're going to terminate any load on the new master and let it catch up, so the replay location matters less than the received one in limiting data loss.

More importantly, a replica with a higher received location will refused to remaster from a replica with a lower one, even if it's not completely caught up on replay. This means that promoting the replica with the highest replay timestamp does not necessarily give you the replica from which all other replicas can remaster. Checking replay timestamp also has a race condition built in if all replicas aren't done replaying yet. You could wait for all replicas to catch up on replay, but if your master is down, you probably don't want to.

The remaining issue is that it's theoretically possible to have to servers with the same received timestamp, but with different log locations, especially if the external code you're using to compare them rounds down the milleseconds.

Still, pg_last_xact_replay_timestamp() is an OK way to find the furthest ahead replica if you have no other choice. It's just not ideal or determinative.

Checking in 9.2

In 9.2, we supply pg_xlog_location_diff(), which lets us compare two xlog locations to see which one is higher:

replica2=# select pg_xlog_location_diff('1/10808DE8','1/FFD1560');

 pg_xlog_location_diff 

-----------------------

8616072

If it's positive, the first value is higher. If it's negative, the second. The problem with this approach becomes obvious if you have, say, seven replicas; you'd have to run 1:1 comparisons on for each pair of replicas, which means doing six individual comparisons, each of which comparisons involves a database call, and a clever piece of code which can treat these comparisons as a sort routine.

More importantly to a lot of users, this function is only available to 9.2, and most users haven't upgraded to 9.2 yet. So this is a determinative solution, but won't really work for a lot of people.

The reason I can't use 9.2's pg_xlogfile_name_offset() for comparisons is that it doesn't work on replicas, making the function of questionable utility.

Workaround Solution

What we could really use is a workaround which would do all of the following:

Give us the received location of each replica as an externally sortable, preferably numeric, value.
Do so with a single, simple database call to each replica.
Not require installation of external procedural languages or C functions, so that there aren't issues with installing or compiling software on each node.

Happily, we can fulfill all three of the above conditions using some ugly-but-functional PL/pgSQL functions, available here.

The way you would use them is as follows:

The master goes down.
Check xlog_location_numeric() on each backend.
Pick the backend with the highest (or tied for highest) numeric position, and check how far behind it is in replay using replay_lag_mb(), but see below.
If the highest replica isn't too far behind on replay, promote it.
If the highest replica is too far behind, drop to the next-highest and check replay lag.

The purpose of the check in replay is to avoid promoting a replica which happens to have the highest received location, but for some reason (such as a really long-running transaction) is actually hours behind on replay. This means making a business decision between data loss and speed of failover, so only you can set the threshold here.

One way you could check the replay lag is using pg_last_xact_replay_timestamp(), at least on 9.1 and later. However, since the master is down and you don't necessarily know the exact time it went down, that means checking the timestamp on all replicas, and comparing against the most recent one. It also means waiting a bit for the replicas to catch up in order to avoid race conditions.

A second, and my preferred, way to check how far the candidate new master is behind in replay is to check how many bytes different pg_last_xlog_replay_location() is from pg_last_xlog_receive_location(). Not only does this not require checking all replicas (i.e. you can test only your candidate new master), but the number of bytes is more directly related to required replay time than the clock time lag is. A replica requires no time to replay portions of the log which relate to idle periods on the master.

In 9.2, this is easily done using pg_xlog_location_diff:

replica2=# SELECT pg_xlog_location_diff(pg_xlog_last_receive_location(), pg_xlog_last_replay_location());

pg_xlog_location_diff

----------------------------

16192

When this value reaches zero, you know this replica is completely caught up. You can also set a threshold for how far you want a replica to be behind on replay before promoting it, such as 1GB (the numeric value is roughly equivalent to bytes).

However, pg_xlog_location_diff doesn't exist in 9.0 or 9.1. So I've provided my own function, replay_lag_mb(), which gives the approximate replay lag in megabytes, regardless of PostgreSQL version.

bench=# select replay_lag_mb();
replay_lag_mb
---------------
93.7

I've also provided a simpler function, all_replayed(), which returns true if replay is caught up.

Hopefully this provides a solution for many of you. And hopefully this leads to us having something simpler built into 9.3!

Notes on the Functions

Link again, in case you didn't see it above: replay location functions.

First, thanks very much to Andrew Geirth ("RhodiumToad") for verifying my logic and correcting my math. And, of course, Simon Riggs, Fujii Masao, Robert Haas, Heikki Linnakangas, Magnus Hagander, and all of the other developers who worked on binary replication.

all_replayed() and replay_lag_mb() are designed to be run only on replicas. They will return NULL on standalone or master servers.
these functions will return the same results regardless of which database they're installed in. However, they can only be called from the database in which they are installed. So you might want to install them in the "postgres" scratch database.
xlog_location_numeric() returns a numeric value which can exceed a 64-bit integer in size. So make sure your calling and sorting code is prepared to handle a larger-than-eight-byte number.
The math for numeric log position is only valid through 9.2. In 9.3, we change the multiplier for the logfile, so we'll need new functions (or, hopefully, something built in).
Yes, I really do have to do that horrible ugly conversion (string to hex to bitmap to bigint to numeric) to turn the hex values to numeric.

If anyone writes administrative code or recipes utilizing these functions, please link it in the comments.

Also, please note that if your real concern is to have a failover replica with the lowest data loss and fastest failover, the best answer is to have a synchronous standby which receives no query load, and fail over to that. Assuming you can afford the extra server, that's better than any of the workarounds above.