We are now ready to accept your submissions to present at PyPgDay. Please submit as soon as possible; the deadline is January 20.
We are particularly looking for speakers who can present on using PostgreSQL and Python together, including Django, GeoDjango, Pylons, psycopg2, SQLAlchemy, PL/Python, NumPy, and the new PostgreSQL JSON features. General talks about PostgreSQL administration and performance, pitched for a developer audience, are also welcome. How-to talks, case studies, personal experiences, and new software introductions are welcome. We will also have lightning talks.
Submit now!
Tuesday, December 18, 2012
Saturday, December 15, 2012
Reminder: pgCon Call For Papers
The pgCon Call For Papers has been open since December 1st. This is part of an effort by the conference committee to get all the speakers notified on time this year, so please submit your proposals before January 15th. See you at pgCon!
Friday, December 7, 2012
Save The Date: PyPgDay
PostgreSQL has become the "default database" of Python users everywhere. And Python has become the most popular scripting language for Postgres. So, to celebrate that, we're throwing a miniconference at PyCon 2013.
PyPgDay will be a full-day PostgreSQL event at PyCon this year, on March 13th (all day), at the Santa Clara Convention Center. Watch the PyPgDay wiki page and this blog for updates on the event, as we get it organized.
In other event news, the Call For Papers for pgCon is now open. Please submit your PostgreSQL hacking, development, performance, and advanced technique proposals.
PyPgDay will be a full-day PostgreSQL event at PyCon this year, on March 13th (all day), at the Santa Clara Convention Center. Watch the PyPgDay wiki page and this blog for updates on the event, as we get it organized.
In other event news, the Call For Papers for pgCon is now open. Please submit your PostgreSQL hacking, development, performance, and advanced technique proposals.
Freezing Your Tuples Off, Part 3
First, read Part 1 and Part 2 of this series.
To understand the purpose of this parameter, you have to understand how vacuum has worked since version 8.4. Vacuum no longer usually scans the entire table; instead, it has a bitmap (called the visibility_map) which tells it which data pages in the table have reclaimable space, so that it scans only the portion of the table which is "dirty". This is a considerable optimization for large tables, where only 20% of the rows might have been updated or deleted since the last vacuum cycle.
This does, however, introduce a problem. Since old, untouched data pages aren't being vacuumed (since they don't have any reclaimable rows anymore), they are not getting frozen regardless of the setting of vacuum_freeze_min_age. This makes it much more likely you'll hit an eventual wraparound vac when you least expect it.
The idea of vacuum_freeze_table_age is to compliment autovacuum_max_freeze_age by telling the database: "If it's almost vacuum_freeze time for this table, and you were vacuuming the table anyway, then scan the whole table and to freeze out tuples." In theory, this should allow you to hold off wraparound vacuum by running a regular, cancellable vacuum instead. In practice, though, that's a dubious benefit for large tables, since either way you get a full-table scan and a vacuum which runs for a long time. Small, heavily updated tables will tend to get most of their pages vacuumed most of the time anyway, and thus be unaffected by this setting.
As such, there's no point in setting it to a particularly low or creative level. Simply set it to 80% of whatever autovacuum_max_freeze_age is set to (recommended: 800 million).
By now it should have occurred to you that Postgres has a serious problem with vacuum freeze behavior on large tables. It's the unfortunate bad side-effect of optimizations in several other areas. You can tweak the settings as I recommended, but you'll still be in danger of having a wraparound vacuum kick in and saturate your IO at some unexpected time. So what can you do to behave better and proactively freeze tables before they get to that point?
Well, you have one piece of information PostgreSQL doesn't. You know when the slow periods in your application usage are. Try following this "flexible freeze" program in a cron script:
During expected "slow periods", run the query from Part I to get a list of tables, and then:
Or, you can use our Flexible Freeze python script.
Obviously the suggestions in this article are workarounds. Where PostgreSQL really needs to go is to find some way to avoid needing to vacuum old, cold data ever. The obstacle to doing this is that nobody has figured out how.
One of the first methods suggested was to have an 8-byte XID, which would postpone wraparound by a billion cycles. However, since there are two XIDs on every row header in the database, this would dramatically increase database sizes for many users, especially the users who need more XIDs in the first place. It would also increase memory requirements for a lot of Postgres operations. A second method I suggested during the 9.2 cycle was to have an "XID cycle" counter in the page header of each data page. This would have the advantage of not increasing the size of rows. However, it would have the drawback of failing in cases where there were rows from more than one XID cycle in the page.
Further, both of these methods hit a major performance problem: the CLOG. PostgreSQL's CLOG tracks which XIDs committed and which rolledback, and thus which rows are valid and which are orphaned. Currently, checking the CLOG is quite rapid because it's kept small. Any patch which involves increasing the size of XIDs or keeping more of them will cause the CLOG to bloat by multiples, substantially affecting response times on transaction-processing workloads. That's unacceptable. We've discussed radically restructuring the CLOG, but that's a major enough change that it would need to come with other benefits than just avoiding freeze.
An alternate approach to never freezing old data at all would be to develop some kind of daemon which did background freezes of chunks of large tables. This daemon would need to be able to tell when the system was "busy" and stop work. It would also need to be able to track what old data pages it had visited and which it hadn't, and which ones need to be revisited because they've been dirtied.
Anyone wanting to take up the above -- either by coding or by funding -- let me know! Until then, happy freezing!
vacuum_freeze_table_age
To understand the purpose of this parameter, you have to understand how vacuum has worked since version 8.4. Vacuum no longer usually scans the entire table; instead, it has a bitmap (called the visibility_map) which tells it which data pages in the table have reclaimable space, so that it scans only the portion of the table which is "dirty". This is a considerable optimization for large tables, where only 20% of the rows might have been updated or deleted since the last vacuum cycle.
This does, however, introduce a problem. Since old, untouched data pages aren't being vacuumed (since they don't have any reclaimable rows anymore), they are not getting frozen regardless of the setting of vacuum_freeze_min_age. This makes it much more likely you'll hit an eventual wraparound vac when you least expect it.
The idea of vacuum_freeze_table_age is to compliment autovacuum_max_freeze_age by telling the database: "If it's almost vacuum_freeze time for this table, and you were vacuuming the table anyway, then scan the whole table and to freeze out tuples." In theory, this should allow you to hold off wraparound vacuum by running a regular, cancellable vacuum instead. In practice, though, that's a dubious benefit for large tables, since either way you get a full-table scan and a vacuum which runs for a long time. Small, heavily updated tables will tend to get most of their pages vacuumed most of the time anyway, and thus be unaffected by this setting.
As such, there's no point in setting it to a particularly low or creative level. Simply set it to 80% of whatever autovacuum_max_freeze_age is set to (recommended: 800 million).
Flexible Freeze
By now it should have occurred to you that Postgres has a serious problem with vacuum freeze behavior on large tables. It's the unfortunate bad side-effect of optimizations in several other areas. You can tweak the settings as I recommended, but you'll still be in danger of having a wraparound vacuum kick in and saturate your IO at some unexpected time. So what can you do to behave better and proactively freeze tables before they get to that point?
Well, you have one piece of information PostgreSQL doesn't. You know when the slow periods in your application usage are. Try following this "flexible freeze" program in a cron script:
During expected "slow periods", run the query from Part I to get a list of tables, and then:
- Set PostgreSQL to "soft vacuum" with a high vacuum_cost_delay (such as 50ms).
- Set PostgreSQL to aggressively freeze tuples, with vacuum_freeze_table_age at 50% of autovacuum_freeze_max_age, and vacuum_freeze_min_age set to 10% of its usual value.
- Until the slow period is almost over, loop through the list, vacuuming each table.
- This will help you avoid wraparound vacuum when you least expect it.
Or, you can use our Flexible Freeze python script.
Further Work
Obviously the suggestions in this article are workarounds. Where PostgreSQL really needs to go is to find some way to avoid needing to vacuum old, cold data ever. The obstacle to doing this is that nobody has figured out how.
One of the first methods suggested was to have an 8-byte XID, which would postpone wraparound by a billion cycles. However, since there are two XIDs on every row header in the database, this would dramatically increase database sizes for many users, especially the users who need more XIDs in the first place. It would also increase memory requirements for a lot of Postgres operations. A second method I suggested during the 9.2 cycle was to have an "XID cycle" counter in the page header of each data page. This would have the advantage of not increasing the size of rows. However, it would have the drawback of failing in cases where there were rows from more than one XID cycle in the page.
Further, both of these methods hit a major performance problem: the CLOG. PostgreSQL's CLOG tracks which XIDs committed and which rolledback, and thus which rows are valid and which are orphaned. Currently, checking the CLOG is quite rapid because it's kept small. Any patch which involves increasing the size of XIDs or keeping more of them will cause the CLOG to bloat by multiples, substantially affecting response times on transaction-processing workloads. That's unacceptable. We've discussed radically restructuring the CLOG, but that's a major enough change that it would need to come with other benefits than just avoiding freeze.
An alternate approach to never freezing old data at all would be to develop some kind of daemon which did background freezes of chunks of large tables. This daemon would need to be able to tell when the system was "busy" and stop work. It would also need to be able to track what old data pages it had visited and which it hadn't, and which ones need to be revisited because they've been dirtied.
Anyone wanting to take up the above -- either by coding or by funding -- let me know! Until then, happy freezing!
Monday, November 12, 2012
PostGIS 2.0 in Berkeley Tommorrow
Brian Hamlin will be presenting PostGIS 2.0 at Soda Hall in Berkeley tommorrow. If you can't make it, it will probably be live on video (not sure about wireless quality).
Tuesday, October 30, 2012
pgConf.EU 2012 report
I just got back from my first European PostgreSQL conference since 2009. If you missed it, you should be sorry you did. Me, I mostly went because it was in Prague this year ... my sweetie really wanted to go, so off we went.
Nearly 300 people attended from what I heard, and the rooms were certainly full enough to support that estimate. I got to see a lot of people I don't see at pgCon. Highlights of the conference for me were:
Thanks to the whole pgConf organizing crew for a great time, and for helping with getting the students to the conference.
Not sure I'll make it next year; October is often a busy time for other conferences, and I think it's more important for me to speak at non-Postgres conferences than to our existing community. However, if they hold it in Barcelona, I might have to find a way to go.
Nearly 300 people attended from what I heard, and the rooms were certainly full enough to support that estimate. I got to see a lot of people I don't see at pgCon. Highlights of the conference for me were:
- Hallway Track: getting to hash stuff out with Joe Celko, Peter Geoghegan, Rob Napier, Hannu Krosing, and Laurenz Albe, whom I never see otherwise. Not all at the same time, of course!
- Playing Chess in the DB: Gianni Colli's pgChess extension and supporting presentation were excellent. If you want evidence that you can do anything with Postgres, load up pgChess!
- GSOC Students: Atri Sharma, Alexander Korotkov, and Qi Huang all came to the conference, got to meet the community, and presented on their projects from Google Summer of Code.
- Lightning Talks: I love LTs, of course. Harald did a particularly good job running them.
- Boat Party: Heroku rented a party boat to cruise up the Vltava River while we drank lots of Budvar and talked databases.
Thanks to the whole pgConf organizing crew for a great time, and for helping with getting the students to the conference.
Not sure I'll make it next year; October is often a busy time for other conferences, and I think it's more important for me to speak at non-Postgres conferences than to our existing community. However, if they hold it in Barcelona, I might have to find a way to go.
Saturday, October 13, 2012
Freezing Your Tuples Off, Part 2
Continued from Part 1.
The vacuum_freeze_min_age setting determines the youngest XID which will be changed to FrozenXID on data pages which are being vacuumed anyway. The advantage of setting it low is that far more XIDs will already be frozen when the data page is finally evicted from memory. This is ideal from a maintenance perspective, as the data page may never need to be read from disk for freezing. The disadvantage of setting it too low is additional time and CPU used during vacuum, especially if the page ends up being vacuumed several times before it's written out.
The other disadvantage of setting it low, according to the pgsql-hackers list, is that you will have less data for reconstructive database forensics if your database gets corrupted. However, given PostgreSQL's very low incidence of corruption bugs, this is not a serious consideration when balanced against the very high cost of a vacuum freeze on a really large table.
Therefore, we want the setting to be low, but not so low that XIDs used a few minutes ago are expiring. This is where things get difficult and not auto-tunable in current Postgres. You really want XIDs to freeze after a few hours, or maybe a day if your application makes use of long-running transactions frequently. But Postgres has no single-step way to determine how many XIDs you use per hour on average.
The best way is to monitor this yourself. In recent versions of PostgreSQL, you can get this from pg_stat_database, which has the counters xact_commit and xact_rollback. If these are part of a monitoring scheme you already have in place (such as Nagios/Cacti, Ganglia or Munin), then you can look at the transaction rates using those tools. If not, you need to follow these three steps:
1. Run this query, and write down the result:
SELECT txid_current();
2. Wait three or four hours (or a day, if you use long-running transactions)
3. Run the query again.
4. Subtract the number from the first query run from the second query run. If the 2nd number is lower, then you've wrapped around zero, and should try again.
5. Round up to the nearest multiple of 10.
So, as an example:
josh=# select txid_current();
txid_current
--------------
1000811
... wait four hours ...
postgres=# select txid_current();
sum
---------
2062747
So my transaction burn rate is 1,061,936 for four hours, which I round to 1,000,000. I then set vacuum_freeze_min_age to 1000000. The approximate burn rate, 250,000 per hour, is also a handy figure to keep to figure out when XID wraparound will happen next (in about 1 year, if I was starting from XID 3).
Without doing the above, it's fairly hard to estimate a reasonable level, given that XID burn rate depends not only on the amount of write activity you're doing, but how you're grouping the writes into transactions. For example, in a data collection database, if you're doing each imported fact as a separate standalone INSERT, you could be burning a million XIDs per hour, but if you're batching them in batches of a thousand rows, that cuts you down to 1000 XIDs per hour. That being said, if you really don't have time to check, here's my rules-of-thumb settings for vacuum_freeze_min_age:
You'll notice that all of these are lower than the default which ships in postgresql.conf. That default, 100 million, is overly conservative, and means that preemtive freezing almost never happens on databases which run with the defaults. Also, the default of 100m is half of 200m, the default for autovacuum_freeze_max_age, meaning that even after you've completely vacuumed an entire table, you're left with many XIDs which are 50% of freeze_max_age old. This causes more wraparound vacuums than are necessary.
Also, to some degree, this isn't worth worrying about below 500 writes/minute, given that it takes 8 years to reach XID wraparound at that rate. Few PostgreSQL installations go 8 years without a dump/reload.
To be continued in Part 3 ...
NOTE: The above blog entry was updated in April 2015 to reflect and improved method suggested by Noah below.
vacuum_freeze_min_age
The vacuum_freeze_min_age setting determines the youngest XID which will be changed to FrozenXID on data pages which are being vacuumed anyway. The advantage of setting it low is that far more XIDs will already be frozen when the data page is finally evicted from memory. This is ideal from a maintenance perspective, as the data page may never need to be read from disk for freezing. The disadvantage of setting it too low is additional time and CPU used during vacuum, especially if the page ends up being vacuumed several times before it's written out.
The other disadvantage of setting it low, according to the pgsql-hackers list, is that you will have less data for reconstructive database forensics if your database gets corrupted. However, given PostgreSQL's very low incidence of corruption bugs, this is not a serious consideration when balanced against the very high cost of a vacuum freeze on a really large table.
Therefore, we want the setting to be low, but not so low that XIDs used a few minutes ago are expiring. This is where things get difficult and not auto-tunable in current Postgres. You really want XIDs to freeze after a few hours, or maybe a day if your application makes use of long-running transactions frequently. But Postgres has no single-step way to determine how many XIDs you use per hour on average.
The best way is to monitor this yourself. In recent versions of PostgreSQL, you can get this from pg_stat_database, which has the counters xact_commit and xact_rollback. If these are part of a monitoring scheme you already have in place (such as Nagios/Cacti, Ganglia or Munin), then you can look at the transaction rates using those tools. If not, you need to follow these three steps:
1. Run this query, and write down the result:
SELECT txid_current();
2. Wait three or four hours (or a day, if you use long-running transactions)
3. Run the query again.
4. Subtract the number from the first query run from the second query run. If the 2nd number is lower, then you've wrapped around zero, and should try again.
5. Round up to the nearest multiple of 10.
So, as an example:
josh=# select txid_current();
txid_current
--------------
1000811
... wait four hours ...
postgres=# select txid_current();
sum
---------
2062747
So my transaction burn rate is 1,061,936 for four hours, which I round to 1,000,000. I then set vacuum_freeze_min_age to 1000000. The approximate burn rate, 250,000 per hour, is also a handy figure to keep to figure out when XID wraparound will happen next (in about 1 year, if I was starting from XID 3).
Without doing the above, it's fairly hard to estimate a reasonable level, given that XID burn rate depends not only on the amount of write activity you're doing, but how you're grouping the writes into transactions. For example, in a data collection database, if you're doing each imported fact as a separate standalone INSERT, you could be burning a million XIDs per hour, but if you're batching them in batches of a thousand rows, that cuts you down to 1000 XIDs per hour. That being said, if you really don't have time to check, here's my rules-of-thumb settings for vacuum_freeze_min_age:
- Low write activity (100 per minute or less): 50000
- Moderate write activity (100-500 per minute): 200000
- High write activity (500 to 4000 per minute): 1000000
- Very high write activity (higher than 4000 per minute): 10000000
You'll notice that all of these are lower than the default which ships in postgresql.conf. That default, 100 million, is overly conservative, and means that preemtive freezing almost never happens on databases which run with the defaults. Also, the default of 100m is half of 200m, the default for autovacuum_freeze_max_age, meaning that even after you've completely vacuumed an entire table, you're left with many XIDs which are 50% of freeze_max_age old. This causes more wraparound vacuums than are necessary.
Also, to some degree, this isn't worth worrying about below 500 writes/minute, given that it takes 8 years to reach XID wraparound at that rate. Few PostgreSQL installations go 8 years without a dump/reload.
To be continued in Part 3 ...
NOTE: The above blog entry was updated in April 2015 to reflect and improved method suggested by Noah below.
Tuesday, October 9, 2012
DVDStore Benchmark Tonight on SFPUG Live
Tonight Jignesh Shah of VMware will be presenting on running the DVDStore benchmark with PostgreSQL at SFPUG. As always, you can catch it live on video if you can't be there. No promises on video quality, though!
Postgres SMF Permissions on Joyent SmartOS
A few of our clients run on the Joyent cloud. Uniquely, Joyent runs
SmartOS, an Illumos (i.e. OpenSolaris) distribution, which is the
default and supported OS on VMs. Like Solaris 10 and 11, SmartOS
controls server services using the Service Management Framework, or SMF,
and XML-configuration-driven tool which supports version-specific
start, stop, restart, reload and auto-vivification for PostgreSQL.
Now, for replication failover between SmartOS hosts on Joyent, one of the things we need to do is STONITH ("Shoot The Other Node In The Head") which includes shutting down Postgres on the old master if it's still running. For various setup reasons, our scripts need to do this as the "postgres" user, not as root. But, thanks to auto-vivification, if we just "pg_ctl stop", SMF will automatically restart PostgreSQL after a few minutes. So we needed to grant permission for the postgres user to enable, disable and restart the PostgreSQL service.
Unfortunately, permissions on services are a wierd alchemy of SMF configuration and RBAC, the Solaris security framework. After several hours of struggling with how to do this, Joyent support came to the rescue. Ryan Puckett gave me this formula:
Now, for replication failover between SmartOS hosts on Joyent, one of the things we need to do is STONITH ("Shoot The Other Node In The Head") which includes shutting down Postgres on the old master if it's still running. For various setup reasons, our scripts need to do this as the "postgres" user, not as root. But, thanks to auto-vivification, if we just "pg_ctl stop", SMF will automatically restart PostgreSQL after a few minutes. So we needed to grant permission for the postgres user to enable, disable and restart the PostgreSQL service.
Unfortunately, permissions on services are a wierd alchemy of SMF configuration and RBAC, the Solaris security framework. After several hours of struggling with how to do this, Joyent support came to the rescue. Ryan Puckett gave me this formula:
echo "postgres.applications:::Manage Postgres::" >> /etc/security/auth_attr echo "postgres::::type=normal;auths=postgres.applications" >> /etc/user_attr svccfg -s postgresql setprop general/action_authorization = astring: postgres.applications svccfg -s postgresql setprop general/value_authorization = astring: postgres.applications
And
I'm delighted to report that, after running the above commands, the
postgres user can run "/usr/sbin/svcadm disable postgresql". Yay!
Monday, October 8, 2012
Determining Furthest Ahead Replica
Here's a common situation for PostgreSQL binary replication setups with multiple replicas: the master is down, and you want to promote the "farthest ahead" replica so that you can remaster the other replicas from it with a minimum of fuss and data loss. Sadly, this is something we make stupidly hard to do in PostgreSQL 9.0, 9.1 and 9.2. While our project is great at making things possible, we're not so great at making them easy.
By "furthest ahead", we mean "which replica has received the most recent data from the master before the master went down?" The relevant place to check this is the replica's transaction log buffer, which is where data goes which has been received from the master.
When you start out, it looks simple. You check the last XLOG receive location on each replica, and take the "highest" one. Let's try that!
Uhhh ...
Yeah, not helpful at all, is it? We give you a completely undocumented pair of hexidecimal values. To make things even more fun, Postgres doesn't have any built-in functions for converting hexidecimal to numeric.
In 9.1 we added pg_last_xact_replay_timestamp(), which gives you a timestamp for the replica replay:
Yay! That solves everything. I can just take the replay timestamp from each replica and pick the latest one, right?
Well, no. Sorry.
First, that's the replay timestamp, and what we're more interested in is the received timestamp, which we can't get. WAL records are first received, and then replayed, so if a replica is under load the replay location can be behind the received location, sometimes considerably. When remastering, we're going to terminate any load on the new master and let it catch up, so the replay location matters less than the received one in limiting data loss.
More importantly, a replica with a higher received location will refused to remaster from a replica with a lower one, even if it's not completely caught up on replay. This means that promoting the replica with the highest replay timestamp does not necessarily give you the replica from which all other replicas can remaster. Checking replay timestamp also has a race condition built in if all replicas aren't done replaying yet. You could wait for all replicas to catch up on replay, but if your master is down, you probably don't want to.
The remaining issue is that it's theoretically possible to have to servers with the same received timestamp, but with different log locations, especially if the external code you're using to compare them rounds down the milleseconds.
Still, pg_last_xact_replay_timestamp() is an OK way to find the furthest ahead replica if you have no other choice. It's just not ideal or determinative.
In 9.2, we supply pg_xlog_location_diff(), which lets us compare two xlog locations to see which one is higher:
If it's positive, the first value is higher. If it's negative, the second. The problem with this approach becomes obvious if you have, say, seven replicas; you'd have to run 1:1 comparisons on for each pair of replicas, which means doing six individual comparisons, each of which comparisons involves a database call, and a clever piece of code which can treat these comparisons as a sort routine.
More importantly to a lot of users, this function is only available to 9.2, and most users haven't upgraded to 9.2 yet. So this is a determinative solution, but won't really work for a lot of people.
The reason I can't use 9.2's pg_xlogfile_name_offset() for comparisons is that it doesn't work on replicas, making the function of questionable utility.
What we could really use is a workaround which would do all of the following:
The way you would use them is as follows:
One way you could check the replay lag is using pg_last_xact_replay_timestamp(), at least on 9.1 and later. However, since the master is down and you don't necessarily know the exact time it went down, that means checking the timestamp on all replicas, and comparing against the most recent one. It also means waiting a bit for the replicas to catch up in order to avoid race conditions.
A second, and my preferred, way to check how far the candidate new master is behind in replay is to check how many bytes different pg_last_xlog_replay_location() is from pg_last_xlog_receive_location(). Not only does this not require checking all replicas (i.e. you can test only your candidate new master), but the number of bytes is more directly related to required replay time than the clock time lag is. A replica requires no time to replay portions of the log which relate to idle periods on the master.
In 9.2, this is easily done using pg_xlog_location_diff:
When this value reaches zero, you know this replica is completely caught up. You can also set a threshold for how far you want a replica to be behind on replay before promoting it, such as 1GB (the numeric value is roughly equivalent to bytes).
However, pg_xlog_location_diff doesn't exist in 9.0 or 9.1. So I've provided my own function, replay_lag_mb(), which gives the approximate replay lag in megabytes, regardless of PostgreSQL version.
bench=# select replay_lag_mb();
replay_lag_mb
---------------
93.7
I've also provided a simpler function, all_replayed(), which returns true if replay is caught up.
Hopefully this provides a solution for many of you. And hopefully this leads to us having something simpler built into 9.3!
First, thanks very much to Andrew Geirth ("RhodiumToad") for verifying my logic and correcting my math. And, of course, Simon Riggs, Fujii Masao, Robert Haas, Heikki Linnakangas, Magnus Hagander, and all of the other developers who worked on binary replication.
Also, please note that if your real concern is to have a failover replica with the lowest data loss and fastest failover, the best answer is to have a synchronous standby which receives no query load, and fail over to that. Assuming you can afford the extra server, that's better than any of the workarounds above.
Checking Log Position
By "furthest ahead", we mean "which replica has received the most recent data from the master before the master went down?" The relevant place to check this is the replica's transaction log buffer, which is where data goes which has been received from the master.
When you start out, it looks simple. You check the last XLOG receive location on each replica, and take the "highest" one. Let's try that!
replica1=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
41A/10808DE8
replica2=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
41A/FFD1560
Uhhh ...
Yeah, not helpful at all, is it? We give you a completely undocumented pair of hexidecimal values. To make things even more fun, Postgres doesn't have any built-in functions for converting hexidecimal to numeric.
Checking in PostgreSQL 9.1
In 9.1 we added pg_last_xact_replay_timestamp(), which gives you a timestamp for the replica replay:
replica1=# select pg_last_xact_replay_timestamp();
pg_last_xact_replay_timestamp
-------------------------------
2012-10-05 10:35:47.527-07
Yay! That solves everything. I can just take the replay timestamp from each replica and pick the latest one, right?
Well, no. Sorry.
First, that's the replay timestamp, and what we're more interested in is the received timestamp, which we can't get. WAL records are first received, and then replayed, so if a replica is under load the replay location can be behind the received location, sometimes considerably. When remastering, we're going to terminate any load on the new master and let it catch up, so the replay location matters less than the received one in limiting data loss.
More importantly, a replica with a higher received location will refused to remaster from a replica with a lower one, even if it's not completely caught up on replay. This means that promoting the replica with the highest replay timestamp does not necessarily give you the replica from which all other replicas can remaster. Checking replay timestamp also has a race condition built in if all replicas aren't done replaying yet. You could wait for all replicas to catch up on replay, but if your master is down, you probably don't want to.
The remaining issue is that it's theoretically possible to have to servers with the same received timestamp, but with different log locations, especially if the external code you're using to compare them rounds down the milleseconds.
Still, pg_last_xact_replay_timestamp() is an OK way to find the furthest ahead replica if you have no other choice. It's just not ideal or determinative.
Checking in 9.2
In 9.2, we supply pg_xlog_location_diff(), which lets us compare two xlog locations to see which one is higher:
replica2=# select pg_xlog_location_diff('1/10808DE8','1/FFD1560');
pg_xlog_location_diff
-----------------------
8616072
If it's positive, the first value is higher. If it's negative, the second. The problem with this approach becomes obvious if you have, say, seven replicas; you'd have to run 1:1 comparisons on for each pair of replicas, which means doing six individual comparisons, each of which comparisons involves a database call, and a clever piece of code which can treat these comparisons as a sort routine.
More importantly to a lot of users, this function is only available to 9.2, and most users haven't upgraded to 9.2 yet. So this is a determinative solution, but won't really work for a lot of people.
The reason I can't use 9.2's pg_xlogfile_name_offset() for comparisons is that it doesn't work on replicas, making the function of questionable utility.
Workaround Solution
What we could really use is a workaround which would do all of the following:
- Give us the received location of each replica as an externally sortable, preferably numeric, value.
- Do so with a single, simple database call to each replica.
- Not require installation of external procedural languages or C functions, so that there aren't issues with installing or compiling software on each node.
The way you would use them is as follows:
- The master goes down.
- Check xlog_location_numeric() on each backend.
- Pick the backend with the highest (or tied for highest) numeric position, and check how far behind it is in replay using replay_lag_mb(), but see below.
- If the highest replica isn't too far behind on replay, promote it.
- If the highest replica is too far behind, drop to the next-highest and check replay lag.
One way you could check the replay lag is using pg_last_xact_replay_timestamp(), at least on 9.1 and later. However, since the master is down and you don't necessarily know the exact time it went down, that means checking the timestamp on all replicas, and comparing against the most recent one. It also means waiting a bit for the replicas to catch up in order to avoid race conditions.
A second, and my preferred, way to check how far the candidate new master is behind in replay is to check how many bytes different pg_last_xlog_replay_location() is from pg_last_xlog_receive_location(). Not only does this not require checking all replicas (i.e. you can test only your candidate new master), but the number of bytes is more directly related to required replay time than the clock time lag is. A replica requires no time to replay portions of the log which relate to idle periods on the master.
In 9.2, this is easily done using pg_xlog_location_diff:
replica2=# SELECT pg_xlog_location_diff(pg_xlog_last_receive_location(), pg_xlog_last_replay_location());
pg_xlog_location_diff
----------------------------
16192
When this value reaches zero, you know this replica is completely caught up. You can also set a threshold for how far you want a replica to be behind on replay before promoting it, such as 1GB (the numeric value is roughly equivalent to bytes).
However, pg_xlog_location_diff doesn't exist in 9.0 or 9.1. So I've provided my own function, replay_lag_mb(), which gives the approximate replay lag in megabytes, regardless of PostgreSQL version.
bench=# select replay_lag_mb();
replay_lag_mb
---------------
93.7
I've also provided a simpler function, all_replayed(), which returns true if replay is caught up.
Hopefully this provides a solution for many of you. And hopefully this leads to us having something simpler built into 9.3!
Notes on the Functions
Link again, in case you didn't see it above: replay location functions.First, thanks very much to Andrew Geirth ("RhodiumToad") for verifying my logic and correcting my math. And, of course, Simon Riggs, Fujii Masao, Robert Haas, Heikki Linnakangas, Magnus Hagander, and all of the other developers who worked on binary replication.
- all_replayed() and replay_lag_mb() are designed to be run only on replicas. They will return NULL on standalone or master servers.
- these functions will return the same results regardless of which database they're installed in. However, they can only be called from the database in which they are installed. So you might want to install them in the "postgres" scratch database.
- xlog_location_numeric() returns a numeric value which can exceed a 64-bit integer in size. So make sure your calling and sorting code is prepared to handle a larger-than-eight-byte number.
- The math for numeric log position is only valid through 9.2. In 9.3, we change the multiplier for the logfile, so we'll need new functions (or, hopefully, something built in).
- Yes, I really do have to do that horrible ugly conversion (string to hex to bitmap to bigint to numeric) to turn the hex values to numeric.
Also, please note that if your real concern is to have a failover replica with the lowest data loss and fastest failover, the best answer is to have a synchronous standby which receives no query load, and fail over to that. Assuming you can afford the extra server, that's better than any of the workarounds above.
Friday, September 28, 2012
Freezing Your Tuples Off, Part 1
Given that it's currently around 50F, damp and foggy in San Francisco (so much for Indian Summer), my thoughts turn to freezing. Specifically, freezing transaction IDs. There's three settings associated with this behavior in postgresql.conf, which I'll be discussing below:
Every write transaction or standalone statement in PostgreSQL gets assigned a unique transaction ID (or XID). This XID determines the precedence visibility, determining which other concurrent users can see row changes associated with the transaction. Read-only transactions do not get assigned an XID. These XIDs get stamped on each row in storage, in the form of xmin and xmax, denoting the minimum and maximum concurrent transactions which can "see" the row. By putting these in the row headers, we decentralize transaction conflict resolution, which supports PostgreSQL's very high concurrency capabilities.
However, there's a problem with XIDs: they're a four-byte integer. This means that after 2.1 billion transactions, they need to wrap around. And by the time the XID counter wraps around, the old XIDs need to be gone from all of the old row headers. This means removing the XID from xmin and xmax and replacing it with the special RelFrozenXID value, indicating a transaction which is so old everyone can see it.
Now, 2.1 billion transactions seems like a lot, but it's not really. If you're doing just 100 write transactions per second (average), you'll go through that in about 10 months. This means that we need to be constantly cleaning old data pages of old XIDs to be ready for wraparound. And, indeed, PostgreSQL is prepared to do this, through the autovacuum facility.
The first autovacuum setting is autovacuum_freeze_max_age, which is an absolute ceiling on how long autovacuum will let you go before it kicks in and starts exhaustively vacuum freezing the old XIDs on your tables with old rows. Hitting this threshold isn't a great thing, because it can generate lots of IO at unexpected times, and freeze autovacuums aren't cancellable (if you cancel them, they come right back).
You can see how close you are to autovacuum freeze anywhere with this query:
select max(age(datfrozenxid)) from pg_database;
However, you don't generally really care about that; what you care about is "how close am I to forced autovacuum freeze on a large table?" Here's that:
SELECT relname, age(relfrozenxid) as xid_age,
pg_size_pretty(pg_table_size(oid)) as table_size
FROM pg_class
WHERE relkind = 'r' and pg_table_size(oid) > 1073741824
ORDER BY age(relfrozenxid) DESC LIMIT 20;
More specifically, that says "give me the top 20 tables over 1GB, sorted by the age of their oldest XID". Results look something like this:
relname | xid_age | table_size
------------------------+-----------+------------
postgres_log | 199785216 | 12 GB
statements | 4551790 | 1271 MB
normal_statement_times | 31 | 12 GB
Now, that above is trouble brewing. We see that postgres_log is at an XID age of 199 million, which is awfully close to the default autovacuum_freeze_max_age of 200 million. And having autovac decide to read and rewrite an entire 12GB table, possibly at peak usage times, can cause unexpected poor application performance.
One thing we can do is raise autovacuum_freeze_max_age. The default, 200m, is very conservative; it's only 10% of our shut-down-the-database threshold. On high-transaction-volume databases, I generally raise it to 1 billion, which is still only the 50% mark.
However, that's just putting off the problem. What we should do is manually freeze the table at at time which is convenient for the application, i.e. during a low activity period or a scheduled downtime. Then we run:
VACUUM FREEZE postgres_log;
Even better, we'd like to prevent ourselves from getting that close to the threshold in the first place. How do we do that? Well, there's some other tuning parameters.
To be continued in Part 2 ...
- autovacuum_freeze_max_age
- vacuum_freeze_min_age
- vacuum_freeze_table_age
The Need To Freeze
Every write transaction or standalone statement in PostgreSQL gets assigned a unique transaction ID (or XID). This XID determines the precedence visibility, determining which other concurrent users can see row changes associated with the transaction. Read-only transactions do not get assigned an XID. These XIDs get stamped on each row in storage, in the form of xmin and xmax, denoting the minimum and maximum concurrent transactions which can "see" the row. By putting these in the row headers, we decentralize transaction conflict resolution, which supports PostgreSQL's very high concurrency capabilities.
However, there's a problem with XIDs: they're a four-byte integer. This means that after 2.1 billion transactions, they need to wrap around. And by the time the XID counter wraps around, the old XIDs need to be gone from all of the old row headers. This means removing the XID from xmin and xmax and replacing it with the special RelFrozenXID value, indicating a transaction which is so old everyone can see it.
Now, 2.1 billion transactions seems like a lot, but it's not really. If you're doing just 100 write transactions per second (average), you'll go through that in about 10 months. This means that we need to be constantly cleaning old data pages of old XIDs to be ready for wraparound. And, indeed, PostgreSQL is prepared to do this, through the autovacuum facility.
autovacuum_freeze_max_age
The first autovacuum setting is autovacuum_freeze_max_age, which is an absolute ceiling on how long autovacuum will let you go before it kicks in and starts exhaustively vacuum freezing the old XIDs on your tables with old rows. Hitting this threshold isn't a great thing, because it can generate lots of IO at unexpected times, and freeze autovacuums aren't cancellable (if you cancel them, they come right back).
You can see how close you are to autovacuum freeze anywhere with this query:
select max(age(datfrozenxid)) from pg_database;
However, you don't generally really care about that; what you care about is "how close am I to forced autovacuum freeze on a large table?" Here's that:
SELECT relname, age(relfrozenxid) as xid_age,
pg_size_pretty(pg_table_size(oid)) as table_size
FROM pg_class
WHERE relkind = 'r' and pg_table_size(oid) > 1073741824
ORDER BY age(relfrozenxid) DESC LIMIT 20;
More specifically, that says "give me the top 20 tables over 1GB, sorted by the age of their oldest XID". Results look something like this:
relname | xid_age | table_size
------------------------+-----------+------------
postgres_log | 199785216 | 12 GB
statements | 4551790 | 1271 MB
normal_statement_times | 31 | 12 GB
Now, that above is trouble brewing. We see that postgres_log is at an XID age of 199 million, which is awfully close to the default autovacuum_freeze_max_age of 200 million. And having autovac decide to read and rewrite an entire 12GB table, possibly at peak usage times, can cause unexpected poor application performance.
One thing we can do is raise autovacuum_freeze_max_age. The default, 200m, is very conservative; it's only 10% of our shut-down-the-database threshold. On high-transaction-volume databases, I generally raise it to 1 billion, which is still only the 50% mark.
However, that's just putting off the problem. What we should do is manually freeze the table at at time which is convenient for the application, i.e. during a low activity period or a scheduled downtime. Then we run:
VACUUM FREEZE postgres_log;
Even better, we'd like to prevent ourselves from getting that close to the threshold in the first place. How do we do that? Well, there's some other tuning parameters.
To be continued in Part 2 ...
Monday, September 10, 2012
PostgreSQL 9.2 out the gate
And ... we're off! PostgreSQL 9.2.0 is finally out, and you can now deploy it for production applications. So lemme talk a little about one of my favorite features from the release, as well as linking you to PostgreSQL 9.2 clothing (yes, really, see bottom of post).
However, pg_stat_statements suffered a major limitation; unlike log processors like pgfouine and pgbadger, pg_stat_statements could only aggregate queries with the same structure and different constants if they were called using the PostgreSQL prepared plans interface. This meant that for the majority of applications (everything except Java, really), queries were not aggregated at all, making pg_stat_statements not that useful in practice.
Peter Geoghegan decided to fix that in this release (funded by the folks at Heroku.com), using the query plan for each query to "normalize" out the constants so that identical queries could he aggregated, regardless of how they were sent to the database. Suddenly, that means I can depend a lot less on log processing in order to tell clients what their worst queries are, and I can even troubleshoot potential bad queries in a new application deployment within minutes of it going live.
We're going to be installing pg_stat_statements on most of our clients who upgrade to 9.2, and apparently Heroku is going to make it available by default.
Peter went a bit further, though, so pg_stat_statements also now collects cache hit/dirty and IO timing information. This gives us ability to diagnose issues with bad application queries which we couldn't do before, even with the delay of log processing.
So the new pg_stat_statements looks something like this:
userid | 10
dbid | 18018
query | SELECT tbalance - sum(delta) FROM pgbench_tellers JOIN pgbench_history USING (tid) WHERE pg
bench_tellers.tid = ? group by tid, tbalance;
calls | 158207
total_time | 5999651.60500004
rows | 158084
shared_blks_hit | 80551793
shared_blks_read | 2
shared_blks_dirtied | 0
shared_blks_written | 0
local_blks_hit | 0
local_blks_read | 0
local_blks_dirtied | 0
local_blks_written | 0
temp_blks_read | 0
temp_blks_written | 0
blk_read_time | 169.666
blk_write_time | 0
Notice all the new columns. blk_read_time is particularly interesting; it shows that we're doing logical IO to access the records we need. Lemme see which queries are doing a lot of logical IO relative to the amount of data they need:
Out of those results, this one is not a surprise:
But this one is:
From that, it seems like we're often updating rows which aren't in cache, and paying a heavy price for it.
pg_stat_statements
Back in version 8.4, Itagaki Takahiro introduced a new Extension called pg_stat_statements. This extension creates a dynamic system view which shows you recently executed queries and aggregates both counts and total time to execute. The idea was to give you rapid feedback on which of your queries where "hot" and consuming a lot of system resources, instead of forcing you to wait for logging and log processing and get information the next day.
However, pg_stat_statements suffered a major limitation; unlike log processors like pgfouine and pgbadger, pg_stat_statements could only aggregate queries with the same structure and different constants if they were called using the PostgreSQL prepared plans interface. This meant that for the majority of applications (everything except Java, really), queries were not aggregated at all, making pg_stat_statements not that useful in practice.
Peter Geoghegan decided to fix that in this release (funded by the folks at Heroku.com), using the query plan for each query to "normalize" out the constants so that identical queries could he aggregated, regardless of how they were sent to the database. Suddenly, that means I can depend a lot less on log processing in order to tell clients what their worst queries are, and I can even troubleshoot potential bad queries in a new application deployment within minutes of it going live.
We're going to be installing pg_stat_statements on most of our clients who upgrade to 9.2, and apparently Heroku is going to make it available by default.
Peter went a bit further, though, so pg_stat_statements also now collects cache hit/dirty and IO timing information. This gives us ability to diagnose issues with bad application queries which we couldn't do before, even with the delay of log processing.
So the new pg_stat_statements looks something like this:
userid | 10
dbid | 18018
query | SELECT tbalance - sum(delta) FROM pgbench_tellers JOIN pgbench_history USING (tid) WHERE pg
bench_tellers.tid = ? group by tid, tbalance;
calls | 158207
total_time | 5999651.60500004
rows | 158084
shared_blks_hit | 80551793
shared_blks_read | 2
shared_blks_dirtied | 0
shared_blks_written | 0
local_blks_hit | 0
local_blks_read | 0
local_blks_dirtied | 0
local_blks_written | 0
temp_blks_read | 0
temp_blks_written | 0
blk_read_time | 169.666
blk_write_time | 0
Notice all the new columns. blk_read_time is particularly interesting; it shows that we're doing logical IO to access the records we need. Lemme see which queries are doing a lot of logical IO relative to the amount of data they need:
SELECT query, total_time,
blk_read_time / calls as read_time_query,
blk_read_time / rows as read_time_row
FROM pg_stat_statements
WHERE calls > 0 AND rows > 0
ORDER BY blk_read_time / calls desc LIMIT 5;Out of those results, this one is not a surprise:
query | select count(*) from pgbench_accounts;
total_time | 33765.04
read_time_query | 7934.87833333333
read_time_row | 7934.87833333333
total_time | 33765.04
read_time_query | 7934.87833333333
read_time_row | 7934.87833333333
But this one is:
query | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
total_time | 18529.712
read_time_query | 6.67768562644125
read_time_row | 6.67768562644125
total_time | 18529.712
read_time_query | 6.67768562644125
read_time_row | 6.67768562644125
From that, it seems like we're often updating rows which aren't in cache, and paying a heavy price for it.
Buy 9.2 Shirts!
To promote/celebrate/cash in on the 9.2 release, I've posted tshirts and hoodies on the PostgreSQL Zazzle store. These feature an original design by cartoonist Chris Lowrance. 15% of the price of these items goes towards buying swag for PostgreSQL advocacy volunteers. So, get one now!
Sunday, September 9, 2012
SFPUG Streaming: PostgreSQL 9.2
This Tuesday (7:15 PM PDT), we'll be broadcasting the SFPUG meeting live on JustinTV. I'll be presenting "PostgreSQL 9.2: Full Throttle Database", showing off all the features of the new version which is coming out very soon. If you can't make it to the meeting in person, then tune in and watch!
If you can make it to the meeting (RSVP here), or at least join the waiting list, then we've got a fun and exciting meeting lined up for you. The meeting is hosted at the new Disqus offices, and we'll have beer and pizza (sponsored by Heroku.com), and cake (sponsored by EngineYard).
We'll also have a fundraising auction with some of the best auction items ever, so bring your wallet and/or checkbook (checks, cash preferred):
Mark Wong's hand-crocheted blue elephant (one of these went for $500 at pgCon)
Linux skateboard, courtesy of The Linux Foundation.
If you can make it to the meeting (RSVP here), or at least join the waiting list, then we've got a fun and exciting meeting lined up for you. The meeting is hosted at the new Disqus offices, and we'll have beer and pizza (sponsored by Heroku.com), and cake (sponsored by EngineYard).
We'll also have a fundraising auction with some of the best auction items ever, so bring your wallet and/or checkbook (checks, cash preferred):
Linux skateboard, courtesy of The Linux Foundation.
Saturday, August 25, 2012
Wrong defaults for zone_reclaim_mode on Linux
My coworker Jeff Frost just published a writeup on "zone reclaim mode" in Linux, and how it can be a problem. Since his post is rather detailed, I wanted to give a "do this" summary:
- zone_reclaim_mode defaults to the wrong value for database servers, 1, on some Linux distributions, including Red Hat.
- This default will both cause Linux to fail to use all available RAM for caching, and throttle writes.
- If you're running PostgreSQL, make sure that zone_reclaim_mode is set to 0.
Friday, August 17, 2012
Today's Security Update: XML vulnerabilities
If you're subscribed to the PostgreSQL News feed (and if not, why aren't you?) you already know that we released another security patch today. This update release patches two security holes having to do with XML functionality, which can be used by any authenticated user for privilege escalation. Everyone should apply the updates at the next reasonable downtime, and users of contrib/xml2 should schedule a downtime this weekend.
You'll notice that this security update release is on a Friday. This is not our normal practice, since it makes it difficult for IT departments to respond in a timely fashion (especially the folks in Australia/Japan, who will be getting the announcement on Saturday). However, as our number of packagers for different platforms has grown, it's become increasingly difficult to schedule coordinated releases, especially during the summer and the India/Pakistan holiday season. As it is, the Solaris folks will be getting their binaries late.
Anyway, about these vulnerabilities: The first one is a vulnerability in the built-in xmlparse() function. Among other things, this function used to allow users to validate the XML against a DTD contained in an external file. However, a creative user would be able to call any file which the "postgres" user has permissions on, and read parts of that file in the error messages from the DTD reader. Since "files the postgres user has access to" includes pg_hba.conf and .pgpass (if defined), this is a dangerous capability. As such, we have had to disable the DTD-validation feature. This validation feature may return at a later date if we can figure out a reasonable way to make it safe.
The second security hole is in the outdated extension (or "contrib module"), xml2. Users still use xml2 rather than the built-in XML because of its XSLT support. The problem is that xslt_process() had a documented "feature" which allowed the function to transform XML documents fetched from external resources, such as URLs -- without requiring superuser permissions. Not only could this be used to read local files on the PostgreSQL server, it could be used to write them as well, making this a much worse security hole if you have xml2 installed. As such, the xslt_process() feature has been disabled, and will probably not return.
We've been sitting on both of these patches for an embarassingly long time (for our project, at least), because we were looking for a solution which didn't involve disabling functionality which is potentially valuable to some users. Sadly, we were not able to.
You'll notice that this security update release is on a Friday. This is not our normal practice, since it makes it difficult for IT departments to respond in a timely fashion (especially the folks in Australia/Japan, who will be getting the announcement on Saturday). However, as our number of packagers for different platforms has grown, it's become increasingly difficult to schedule coordinated releases, especially during the summer and the India/Pakistan holiday season. As it is, the Solaris folks will be getting their binaries late.
Anyway, about these vulnerabilities: The first one is a vulnerability in the built-in xmlparse() function. Among other things, this function used to allow users to validate the XML against a DTD contained in an external file. However, a creative user would be able to call any file which the "postgres" user has permissions on, and read parts of that file in the error messages from the DTD reader. Since "files the postgres user has access to" includes pg_hba.conf and .pgpass (if defined), this is a dangerous capability. As such, we have had to disable the DTD-validation feature. This validation feature may return at a later date if we can figure out a reasonable way to make it safe.
The second security hole is in the outdated extension (or "contrib module"), xml2. Users still use xml2 rather than the built-in XML because of its XSLT support. The problem is that xslt_process() had a documented "feature" which allowed the function to transform XML documents fetched from external resources, such as URLs -- without requiring superuser permissions. Not only could this be used to read local files on the PostgreSQL server, it could be used to write them as well, making this a much worse security hole if you have xml2 installed. As such, the xslt_process() feature has been disabled, and will probably not return.
We've been sitting on both of these patches for an embarassingly long time (for our project, at least), because we were looking for a solution which didn't involve disabling functionality which is potentially valuable to some users. Sadly, we were not able to.
Wednesday, August 15, 2012
My Autumn Travel Schedule
Just updating folks on where I'll be and what I'll be presenting, in case anyone wants to say "hello" or buy me a beer:
August: LinuxCon, San Diego. Presenting "The Accidental DBA".
September: Postgres Open, Chicago. Presenting an updated "Super Jumbo Deluxe". Also, doing a big PostgreSQL 9.2 talk for SFPUG.
October: pgconf.EU. Not sure what I'll be presenting; talk acceptances aren't final yet. But, hey, Prague!
November: no conferences, thank goodness.
December: Back to San Diego for Usenix LISA. Working a booth and doing a Guru Session with Selena Deckelmann and Joe Conway, and likely a BOF as well. Drop by the booth! Possible San Francisco pgDay in December; watch this space for more information.
August: LinuxCon, San Diego. Presenting "The Accidental DBA".
September: Postgres Open, Chicago. Presenting an updated "Super Jumbo Deluxe". Also, doing a big PostgreSQL 9.2 talk for SFPUG.
October: pgconf.EU. Not sure what I'll be presenting; talk acceptances aren't final yet. But, hey, Prague!
November: no conferences, thank goodness.
December: Back to San Diego for Usenix LISA. Working a booth and doing a Guru Session with Selena Deckelmann and Joe Conway, and likely a BOF as well. Drop by the booth! Possible San Francisco pgDay in December; watch this space for more information.
Monday, August 13, 2012
Launchpad minimizes downtime using 9.1 replication
Canonical staff member Robert Collins has a nice blog about how Launchpad reduced downtimes for agile schema deployment from an hour down to 5 seconds. They did this over the last couple months by moving from Slony to Slony+pgBouncer and then to pgBouncer+Binary Replication. This is what we make nice tools for; to make our users' lives easier.
Friday, August 10, 2012
MySQL-to-PostgreSQL Migration Data from The451.com
As you know, due to PostgreSQL's wide redistribution, worldwide user base, and liberal licensing policies (i.e. no registration required), hard data on PostgreSQL adoption is somewhat hard to come by. That's why I'm very grateful to The451.com, an open-source-friendly analyst service, for sharing with me the PostgreSQL-relevant contents of their report, MySQL vs. NoSQL and NewSQL: 2011-2015.
The451.com is no stranger to open source databases; their analytics services are built using PostgreSQL, MySQL and probably others. So their analysis is a bit more on-target than other analysts (not that I would be thinking of anyone in particular) who are still treating open source databases as a fringe alternative. Matt Aslett is their database and information storage technology guru.
The part of their report I found the most interesting was this part:
Of course, this means that us PostgreSQL folks really need to work even harder on being welcoming to former MySQL users. They're our users now, and if they have funny ideas about SQL syntax, we need to be encouraging and helpful rather than critical. And, above all, not bash MySQL.
The summary report also has the first solid figure I've seen on PostgreSQL market share in a while (hint: it's higher than 10%). You can access it here. To access the full base report, you need to apply for a full free trial membership. As a warning, that free trial registration feeds into their sales reps, so don't be surprised if you get a call later.
The451.com is no stranger to open source databases; their analytics services are built using PostgreSQL, MySQL and probably others. So their analysis is a bit more on-target than other analysts (not that I would be thinking of anyone in particular) who are still treating open source databases as a fringe alternative. Matt Aslett is their database and information storage technology guru.
The part of their report I found the most interesting was this part:
In fact, despite significant interest in NoSQL and NewSQL products, almost as many MySQL users had deployed PostgreSQL as a direct replacement for MySQL (17.6%) than all of the NoSQL and NewSQL databases combined (20%).And:
Our survey results support this anecdotal evidence, with PostgreSQL by far the most popular choice among users as a direct replacement for MySQL, ahead of MySQL derivatives such as MariaDB, Percona Server and Amazon's RDS (Relational Database Service).I'd some idea that we were seeing a lot of post-Oracle-acquisition MySQL refugees in the PostgreSQL community. However, given the strong appeal that the new databases have for web developers, I'd assumed that three times as many developers were going to non-relational solutions as were to PostgreSQL. I'm pleasantly surprised to see these figures from the451.com.
Of course, this means that us PostgreSQL folks really need to work even harder on being welcoming to former MySQL users. They're our users now, and if they have funny ideas about SQL syntax, we need to be encouraging and helpful rather than critical. And, above all, not bash MySQL.
The summary report also has the first solid figure I've seen on PostgreSQL market share in a while (hint: it's higher than 10%). You can access it here. To access the full base report, you need to apply for a full free trial membership. As a warning, that free trial registration feeds into their sales reps, so don't be surprised if you get a call later.
Thursday, August 9, 2012
See you in Chicago!
Just bought my plane tickets for Postgres Open in Chicago. I'm really looking forward to this one, which will be even more user/application-developer-oriented than the first Postgres Open. We have a keynote by Jacob Kaplan-Moss, the founder of Django, and talks by staff from Heroku, Engineyard, Evergreen, and Paul Ramsey of PostGIS, as well as some of the usual suspects. Register and buy a plane ticket now! There's still time!
I'll be presenting an updated version of my PostgreSQL for data warehousing talk, Super Jumbo Deluxe. My coworker Christophe will do PostgreSQL When It's Not Your Job.
Chicago is a great city to visit, too, and September is a good time to be there weather-wise. It's generally sunny and pleasant but not too warm, and the flying spiders are gone. There's tons of museums and world-renowned restaurants. In fact, I'm bringing Kris this year.
Oh, and there's still Sponsorship slots open, hint, hint.
I'll be presenting an updated version of my PostgreSQL for data warehousing talk, Super Jumbo Deluxe. My coworker Christophe will do PostgreSQL When It's Not Your Job.
Chicago is a great city to visit, too, and September is a good time to be there weather-wise. It's generally sunny and pleasant but not too warm, and the flying spiders are gone. There's tons of museums and world-renowned restaurants. In fact, I'm bringing Kris this year.
Oh, and there's still Sponsorship slots open, hint, hint.
Monday, August 6, 2012
PostgresXC Live Streaming at SFPUG Aug. 7
Once again, we will have SFPUG Live streaming. This month's presentation is Mason Sharp presenting PostgresXC -- the Clustered, Write-Scalable Postgres. Video will be on Justin.TV; I will try to make HD video work this time. We'll see!
Streaming video should start around 7:15PM PDT, +/- 10 minutes.
Streaming video should start around 7:15PM PDT, +/- 10 minutes.
Monday, July 9, 2012
SFPUG July: MADLib
Tommorrow Hitoshi Harada of Greenplum will be presenting about MADlib, the analytics and machine learning library for PostgreSQL at the offices of SwitchFly in San Francisco. The meetup is already full, so if you can't make it, you can tune in on live streaming video.
This time, we'll make sure the video stays up so that folks and watch it later.
This time, we'll make sure the video stays up so that folks and watch it later.
Friday, June 22, 2012
PostgreSQL Wants Your Submission
So there's three Calls For Papers right now which need/want more PostgreSQL speakers: Postgres Open, PostgreSQL Europe, and LinuxConf AU.
Postgres Open's deadline is June 26th, so please submit something soon. We're looking for case studies, new innovations, howtos and other talks. Postgres Open is for PostgreSQL's business and user community, so commercial forks and products are welcome. You can also sponsor!
PostgreSQL Europe will be in Prague this year. This will mean higher-than-usual attendance; I'm submitting a talk for the first time in 3 years. You should, too!
LinuxConf Australia wants more PostgreSQL talks! Please submit one. If we can get enough PostgreSQL people there -- especially PostgreSQL people from Australia -- we could hold a miniconf. (As a warning, though, travel sponsorships are limited).
Postgres Open's deadline is June 26th, so please submit something soon. We're looking for case studies, new innovations, howtos and other talks. Postgres Open is for PostgreSQL's business and user community, so commercial forks and products are welcome. You can also sponsor!
PostgreSQL Europe will be in Prague this year. This will mean higher-than-usual attendance; I'm submitting a talk for the first time in 3 years. You should, too!
LinuxConf Australia wants more PostgreSQL talks! Please submit one. If we can get enough PostgreSQL people there -- especially PostgreSQL people from Australia -- we could hold a miniconf. (As a warning, though, travel sponsorships are limited).
Sunday, June 17, 2012
Creating a Table With Exactly One Row
Here's a situation which comes up a lot in database design: the one-row table. That is, a table designed to hold one and only one row, which contains some settings or metadata about the database or application. For example, imagine that we want to have a one-row table which holds the current database revision version and the date it was last upgraded:
Since the database is only one specific version at any time, it would cause problems for the upgrade scripts if there were more than one row in this table. Which row should they believe? So you want to ensure that there's never more than one row.
You could, of course, do this using a trigger. However, a trigger is complex, and a BEFORE INSERT trigger needs to either count the rows, or be installed after a sample row is added. Either is more work than it needs to be. Try this instead:
This works because the version column is already declared NOT NULL. Since you've created a unique index on the NOT NULL expression, you can never insert a second row because it would violate the uniqueness condition.
Unfortunately, there's no clever trick to preventing that one row from being deleted; that requires a BEFORE DELETE trigger:
Now this all works as expected; you can insert the first row to start the table, and only update it after that:
postgres=# INSERT INTO db_version VALUES ( '1.0',now());
INSERT 0 1
postgres=# INSERT INTO db_version VALUES ( '1.1',now());
ERROR: duplicate key value violates unique constraint "db_version_one_row"
DETAIL: Key ((version IS NOT NULL))=(t) already exists.
postgres=# UPDATE db_version SET version = '1.1', upgraded_on = now();
UPDATE 1
postgres=# delete from db_version;
ERROR: You may not delete the DB version!
STATEMENT: delete from db_version;
CREATE TABLE db_version (
version NUMERIC NOT NULL,
upgraded_on TIMESTAMPTZ NOT NULL
);
Since the database is only one specific version at any time, it would cause problems for the upgrade scripts if there were more than one row in this table. Which row should they believe? So you want to ensure that there's never more than one row.
You could, of course, do this using a trigger. However, a trigger is complex, and a BEFORE INSERT trigger needs to either count the rows, or be installed after a sample row is added. Either is more work than it needs to be. Try this instead:
CREATE UNIQUE INDEX db_version_one_row
ON db_version((version IS NOT NULL));
This works because the version column is already declared NOT NULL. Since you've created a unique index on the NOT NULL expression, you can never insert a second row because it would violate the uniqueness condition.
Unfortunately, there's no clever trick to preventing that one row from being deleted; that requires a BEFORE DELETE trigger:
CREATE FUNCTION db_version_no_delete ()
RETURNS trigger
LANGUAGE plpgsql AS $f$
BEGIN
RAISE EXCEPTION 'You may not delete the DB version!';
END; $f$;
CREATE TRIGGER db_version_no_delete
BEFORE DELETE ON db_version
FOR EACH ROW EXECUTE PROCEDURE db_version_no_delete();
Now this all works as expected; you can insert the first row to start the table, and only update it after that:
postgres=# INSERT INTO db_version VALUES ( '1.0',now());
INSERT 0 1
postgres=# INSERT INTO db_version VALUES ( '1.1',now());
ERROR: duplicate key value violates unique constraint "db_version_one_row"
DETAIL: Key ((version IS NOT NULL))=(t) already exists.
postgres=# UPDATE db_version SET version = '1.1', upgraded_on = now();
UPDATE 1
postgres=# delete from db_version;
ERROR: You may not delete the DB version!
STATEMENT: delete from db_version;
Friday, June 15, 2012
postgresql.conf: max_locks_per_transaction
PostgreSQL, and the postgresql.conf file, has around 150 parameters which you don't need to care about 98% of the time (as opposed to the 50-or-so settings you do need to care about more often), except that every once in a while you run into a situation which requires you to learn about some obscure parameter. That is, after all, why it's a changeable setting and not just hard-coded. max_locks_per_transaction is one setting.
The purpose of max_locks_per_transaction is to determine the size of the virtual locks "table" in memory. By default, it's set to 64, which means that Postgres is prepared to track up to (64 X number of open transactions) locks. For example, if you have it set at the default, and you currently have 10 concurrent sessions with transactions open, you can have up to 640 total locks held between all sessions. The reason to have a limit is to avoid using dedicated shared memory if you don't need more locks than that.
Most of the time for most users, But every once in a while, it's not:
The reason why the database above ran out of locks was that a few sessions were holding up to 1800 locks, most of them RowExclusiveLock. Given that a lock in Postgres is usually a lock on an object (like a table or part of a table) and not on a row, holding 1800 locks in one transaction is somewhat unusual. Why so many locks?
Well, the database in question has three tables each of which has over a hundred partitions. One frequent application activity was running an UPDATE against each of these partitioned tables with no partition condition in it, causing the UPDATE to check all partitions of each table. This resulted in RowExclusiveLocks on each partition and each index on each partition ... 1800 locks in total. Note that it needs this lock even though only one partition had rows which were actually updated; despite the name, it's a lock on the table or index, not on a specific row necessarily. This lock prevents certain things, like dropping the table or index while the locking transaction is still running.
So that's one time you probably want to increase max_locks_per_transaction out-of-the-box: if your database has several tables with many partitions, or if you use table inheritance a lot for other purposes. Especially since increasing it requires a restart.
The purpose of max_locks_per_transaction is to determine the size of the virtual locks "table" in memory. By default, it's set to 64, which means that Postgres is prepared to track up to (64 X number of open transactions) locks. For example, if you have it set at the default, and you currently have 10 concurrent sessions with transactions open, you can have up to 640 total locks held between all sessions. The reason to have a limit is to avoid using dedicated shared memory if you don't need more locks than that.
Most of the time for most users, But every once in a while, it's not:
2012-06-11 14:20:05.703 PDT,"processor","breakpad",17155,"[local]",4fd660cd.4303,2,"SELECT",2012-06-11 14:19:09 PDT,86/199551,0,ERROR,53200,"out of shared memory",,"You might need to increase max_locks_per_transaction.",,,,"select j.id, pj.uuid, 1, j.starteddatetime from jobs j right join priority_jobs_2849 pj on j.uuid = pj.uuid",,,""The above helpful message is from the activity log. Unfortunately, the error which the client gets is just "out of shared memory", which is not that helpful ("what do you mean 'out of shared memory'? I have 4GB!").
The reason why the database above ran out of locks was that a few sessions were holding up to 1800 locks, most of them RowExclusiveLock. Given that a lock in Postgres is usually a lock on an object (like a table or part of a table) and not on a row, holding 1800 locks in one transaction is somewhat unusual. Why so many locks?
Well, the database in question has three tables each of which has over a hundred partitions. One frequent application activity was running an UPDATE against each of these partitioned tables with no partition condition in it, causing the UPDATE to check all partitions of each table. This resulted in RowExclusiveLocks on each partition and each index on each partition ... 1800 locks in total. Note that it needs this lock even though only one partition had rows which were actually updated; despite the name, it's a lock on the table or index, not on a specific row necessarily. This lock prevents certain things, like dropping the table or index while the locking transaction is still running.
So that's one time you probably want to increase max_locks_per_transaction out-of-the-box: if your database has several tables with many partitions, or if you use table inheritance a lot for other purposes. Especially since increasing it requires a restart.
Tuesday, June 12, 2012
Video SFPUG June 13: Build Your Application Inside Postgres
Tommorrow night ( June 13th, 7:15PM PDT ) Leon Starr will be presenting "Building Your Application Inside PostgreSQL" for the San Francisco PostgreSQL User Group. As usual, this SFPUG will be broadcast live on video.
In Leon's words, the presentation will be about:
In Leon's words, the presentation will be about:
Leon Starr has built a full featured open source Executable UML model editor entirely in plpgsql. WHY? Leon presents the case for building a tightly constrained schema and letting the RDBMS do the heavy logic lifting. Code patterns will be presented where object and db principles are intertwined with an eye toward deriving the best of both worlds. Naturally, plpgsql has its limitations as does the presenter who is relatively new to the language. You are invited to poke fun at his novice coding techniques and weigh in with your experiences and thoughts regarding the appropriateness of stored procedures vs. conventional code.We have a new broadcaster, JustinTV, who is also physically hosting the event. JustinTV runs on PostgreSQL so they're going to be our live streaming of choice for the forseeable future. I also have a new webcam, so hopefully video quality will be better than last time.
Thursday, May 31, 2012
pgCon Article up on LWN.net
As I mentioned, I'm back to writing for Linux Weekly News. My latest, a summary of interesting developments at pgCon, is now up. Note that reading this article requires an LWN subscription; since they pay me, I don't want to give away the content for free. If you don't want to subscribe, wait two weeks and the article will be free.
I also wrote a summary of what's in the PostgreSQL 9.2 Beta for LWN, which does not require a subscription. Enjoy!
I also wrote a summary of what's in the PostgreSQL 9.2 Beta for LWN, which does not require a subscription. Enjoy!
Monday, May 21, 2012
Testing 9.2: Autovacuum Logging
Since PostgreSQL 9.2 is in beta now, I've been using some generously donated cloud server time on GoGrid to test various features in the new release. One of the ones which is working brilliantly is the new, more verbose logging for Autovacuum.
Greg Smith and Noah Misch added additional information to the logging you get when you set log_autovacuum_min_duration. This includes information on the work done by autovacuum (buffers, pages and tuples read and written), as well as information on when autovacuum "skips" because it can't get a lock. Here's a sample:
LOG: automatic analyze of table "bench2.public.pgbench_tellers" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: automatic vacuum of table "bench2.public.pgbench_branches": index scans: 0
pages: 0 removed, 2 remain
tuples: 114 removed, 200 remain
buffer usage: 44 hits, 0 misses, 3 dirtied
avg read rate: 0.000 MiB/s, avg write rate: 45.334 MiB/s
system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: automatic analyze of table "bench2.public.pgbench_branches" system usage: CPU 0.00s/0.00u sec elapsed 0.02 sec
LOG: automatic analyze of table "bench2.public.pgbench_tellers" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
This will go a long way towards helping users troubleshoot autovacuum in the future. Great work, Greg and Noah!
Greg Smith and Noah Misch added additional information to the logging you get when you set log_autovacuum_min_duration. This includes information on the work done by autovacuum (buffers, pages and tuples read and written), as well as information on when autovacuum "skips" because it can't get a lock. Here's a sample:
LOG: automatic analyze of table "bench2.public.pgbench_tellers" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: automatic vacuum of table "bench2.public.pgbench_branches": index scans: 0
pages: 0 removed, 2 remain
tuples: 114 removed, 200 remain
buffer usage: 44 hits, 0 misses, 3 dirtied
avg read rate: 0.000 MiB/s, avg write rate: 45.334 MiB/s
system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: automatic analyze of table "bench2.public.pgbench_branches" system usage: CPU 0.00s/0.00u sec elapsed 0.02 sec
LOG: automatic analyze of table "bench2.public.pgbench_tellers" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
This will go a long way towards helping users troubleshoot autovacuum in the future. Great work, Greg and Noah!
Monday, May 7, 2012
Amazonstravaganza tommorrow, on UStream
Tommorrow we're going to have two back-to-back SFPUG presentations about running PostgreSQL on AWS: Scalr presenting their stuff, and "Amazon With Reduced Tears" by Christophe Pettus. The event is sold out at 100 RSVPs, though, so I suggest you watch it on video!
Barring severe technical difficulties, it will be presented live on SFPUG Ustream. Festivities should start around 7:15 PM PDT.
Barring severe technical difficulties, it will be presented live on SFPUG Ustream. Festivities should start around 7:15 PM PDT.
Sunday, May 6, 2012
Random Page Cost Revisited
Because the planner has limited information about the host system PostgreSQL is running on, we offer several planner configuration variables, which can be set either dynamically or in the postgresql.conf file. One of these, which has been a source of much contention in the past, is Random Page Cost (or random_page_cost, hereafter RPC). For the last couple of years RPC was not a variable you needed to worry about, but now, thanks to advances in storage technology, it's become one of the essential parameters again.
Random Page Cost is meant to represent the coefficient between the cost of looking up one row (out of many) via sequential reads, vs. the cost of looking up a single row individually using random access (disk seeks). This factor strongly influences the planner's decisions to use indexes vs. table scan, and composite vs. simple indexes, in executing queries.
For many generations of spinning hard drives, this ratio remained about the same, around 4.0.
Things have changed recently though. In fact, there are five now-common types of storage where you want to change RPC in order to match a different scan/seek ratio for different hardware.
1. High-End NAS/SAN: when working well, your large storage box combines a large volatile cache with many many disks, improving the number of concurrent seeks you can do. However, if storage is not direct-attach, you usually have limited throughput. This means that a random_page_cost of 2.5 or 3.0 is more appropriate.
2. Amazon EBS and Heroku: EBS, especially when "RAIDed", is an extreme example of the NAS/SAN case. Seeks are fast with virtually unlimited concurrency, while storage bandwitdh is erratic and slow, limiting full table scans. As such, we often use a random_page_cost of 1.1 to 2.0.
3. Bad SANs: the advent of iSCSI seems to have inspired a large variety of very poor-performing inexpensive storage units, such as the lower-end Dell Equallogics and various SATA-based Promise units. These boxes are cheaper for a reason, folks! Anyway, their quirks often give these boxes unique performance characteristics which strongly affect database access and prompt you to modify planner parameters. For example, we discovered through testing that one Promise box had decent throughput, but seek rates slower than a single drive due to a combination of architecture and driver issues. As such, we set RPC to 6.0. You'll have to trust the results of bonnie++ and fio in order to figure out where you should set things, though; each inferior unit is inferior in a different way.
4. SSDs: in general, SSDs have not lived up to their early hype for performance; random writes and sequential scans on most SSDs are only incrementally better than on HDDs. One area where SSDs shine, though, is random page access; a good array of SSDs can make even large index scans faster than sequential scans. Therefore, use a random_page_cost of 1.5 to 2.5.
5. NvRAM (or NAND): durable memory, such as FusionIO and Virident drives, do deliver on most of the performance promises originally made about SSDs. Everything is faster, sometimes orders of magnitude faster, than HDDs, most of all random access. While the concept of NvRAM might make you inclined to set RPC to 1, forcing almost 100% index access, that's going a little far because PostgreSQL still has a disk page structure. Therefore, I suggest 1.0 to 1.5.
Note that two other variables, effective_cache_size and effective_io_concurrency also affect seek/scan behavior. Possibly I will blog about them at a different time. Further, I have only been able to test with relatively few models and configurations of the above types of storage, so test, monitor and check your results before committing to a long-term configuration change.
Random Page Cost is meant to represent the coefficient between the cost of looking up one row (out of many) via sequential reads, vs. the cost of looking up a single row individually using random access (disk seeks). This factor strongly influences the planner's decisions to use indexes vs. table scan, and composite vs. simple indexes, in executing queries.
For many generations of spinning hard drives, this ratio remained about the same, around 4.0.
Things have changed recently though. In fact, there are five now-common types of storage where you want to change RPC in order to match a different scan/seek ratio for different hardware.
1. High-End NAS/SAN: when working well, your large storage box combines a large volatile cache with many many disks, improving the number of concurrent seeks you can do. However, if storage is not direct-attach, you usually have limited throughput. This means that a random_page_cost of 2.5 or 3.0 is more appropriate.
2. Amazon EBS and Heroku: EBS, especially when "RAIDed", is an extreme example of the NAS/SAN case. Seeks are fast with virtually unlimited concurrency, while storage bandwitdh is erratic and slow, limiting full table scans. As such, we often use a random_page_cost of 1.1 to 2.0.
3. Bad SANs: the advent of iSCSI seems to have inspired a large variety of very poor-performing inexpensive storage units, such as the lower-end Dell Equallogics and various SATA-based Promise units. These boxes are cheaper for a reason, folks! Anyway, their quirks often give these boxes unique performance characteristics which strongly affect database access and prompt you to modify planner parameters. For example, we discovered through testing that one Promise box had decent throughput, but seek rates slower than a single drive due to a combination of architecture and driver issues. As such, we set RPC to 6.0. You'll have to trust the results of bonnie++ and fio in order to figure out where you should set things, though; each inferior unit is inferior in a different way.
4. SSDs: in general, SSDs have not lived up to their early hype for performance; random writes and sequential scans on most SSDs are only incrementally better than on HDDs. One area where SSDs shine, though, is random page access; a good array of SSDs can make even large index scans faster than sequential scans. Therefore, use a random_page_cost of 1.5 to 2.5.
5. NvRAM (or NAND): durable memory, such as FusionIO and Virident drives, do deliver on most of the performance promises originally made about SSDs. Everything is faster, sometimes orders of magnitude faster, than HDDs, most of all random access. While the concept of NvRAM might make you inclined to set RPC to 1, forcing almost 100% index access, that's going a little far because PostgreSQL still has a disk page structure. Therefore, I suggest 1.0 to 1.5.
Note that two other variables, effective_cache_size and effective_io_concurrency also affect seek/scan behavior. Possibly I will blog about them at a different time. Further, I have only been able to test with relatively few models and configurations of the above types of storage, so test, monitor and check your results before committing to a long-term configuration change.
Thursday, May 3, 2012
SELECT fire_proton_torpedoes();
At pgCon this year you will have the chance to reduce your fellow attendees to their component elements, and win prizes for it!
We will be having a Schemaverse tournament for the length of the conference. Schemaverse is an amazing game which demonstrates that PostgreSQL isn't just a database, it's an entire development environment. You build a fleet of starships, fight battles, take planets, trade, and conquer the galaxy ... all from the psql command line. Write PL/pgSQL functions to command your armies!
To make things even more interesting, if you can hack into the Postgres server, you can win the game.
The Schemaverse tournament will begin Wednesday night and end shortly before the closing session at pgCon. The winner will get some one-of-a-kind PostgreSQL and Schemaverse swag, and even runners up will get posters or tshirts.
If you've already registered for pgCon, look for email from Dan Langille about how to sign up for the tournament. In the meantime, get over to the Schemaverse website and practice your space battle skills to get ready for the big event.
We will be having a Schemaverse tournament for the length of the conference. Schemaverse is an amazing game which demonstrates that PostgreSQL isn't just a database, it's an entire development environment. You build a fleet of starships, fight battles, take planets, trade, and conquer the galaxy ... all from the psql command line. Write PL/pgSQL functions to command your armies!
To make things even more interesting, if you can hack into the Postgres server, you can win the game.
The Schemaverse tournament will begin Wednesday night and end shortly before the closing session at pgCon. The winner will get some one-of-a-kind PostgreSQL and Schemaverse swag, and even runners up will get posters or tshirts.
If you've already registered for pgCon, look for email from Dan Langille about how to sign up for the tournament. In the meantime, get over to the Schemaverse website and practice your space battle skills to get ready for the big event.
Monday, April 30, 2012
Call for Lightning Talks for pgCon
Hackers, users, pgCon attendees:
You want to give a lightning talk at pgCon!
Yes, you do. The fun, the glory, the laughter, the everlasting fame! These can all be yours.
Be one of the ten "brave and true" who put together five minutes about PostgreSQL tools, experiences, forks, ideas, websites, or even (especially) jokes. Anything from "hacking wal files" to "the PostgreSQL drinking game" is an acceptable topic for the lighting talks. A short schedule:
You want to give a lightning talk at pgCon!
Yes, you do. The fun, the glory, the laughter, the everlasting fame! These can all be yours.
Be one of the ten "brave and true" who put together five minutes about PostgreSQL tools, experiences, forks, ideas, websites, or even (especially) jokes. Anything from "hacking wal files" to "the PostgreSQL drinking game" is an acceptable topic for the lighting talks. A short schedule:
- Right Now: send your lightning talk idea to light@pgcon.org. I'll need a title, speaker full name, speaker cell phone number, and brief (one sentence) description.
- This Friday: I'll get back to you with acceptance (or not)
- Friday, May 11th (or sooner) you get me your PDF slides for the talk.
- Thursday, May 17, 5:15 PM: meet Magnus Hagander in the plenary presentation room for your order of speaking.
- Thursday, May 17, 5:30PM to 6:30PM: you and 9 others deliver your talks
Friday, April 27, 2012
Sharding Postgres with Instagram
On Tuesday last week we had a terrific SFPUG meeting at which Mike Kreiger of Instagram explained how they grew and eventually sharded their 2TB of Postgres data to support 27 million users. It's a great presentation which explains the growth process of a successful web/mobile startup, as well as horizontally scaling PostgreSQL.
Yes, you too can use PostgreSQL to make One Billion Dollars!
Video is on UStream. Sorry you can't see the slides on the video; we had technical issues with the camera. Slides are here, you can click along with Mike talking on your own.
Yes, you too can use PostgreSQL to make One Billion Dollars!
Video is on UStream. Sorry you can't see the slides on the video; we had technical issues with the camera. Slides are here, you can click along with Mike talking on your own.
Tuesday, April 24, 2012
Red Hat Kernel cache clearing issue
Recently, mega-real-estate sales site Tigerlead called us with a very strange problem. One of their dedicated PostgreSQL servers refused to use most of its available RAM, forcing the system to read from disk. Given that the database was 60GB in size and the server had 96GB of RAM, this was a painful performance degradation.
Output of free -m:
So, I tried forcing a cache fill by doing a pgdump. This caused the cache to mostly fill free memory -- but then Linux aggressively cleared the cache, again getting it down to around 40GB of cache within a few minutes. This seemed to be the case no matter what we did, including tinkering with the vm parameters, increasing the size of the swap file, and changing shared_buffers. This was highly peculiar; it was as if Linux was convinced that we had half as much RAM as we did.
Output of free -m:
total used free shared buffers cached
Mem: 96741 50318 46422 0 21 44160
-/+ buffers/cache: 6136 90605
Swap: 90111 3 90107
As you can see here, the system is only using half the free memory for cache, and leaving the other half free. This would be normal behavior if only half the cache were needed, but IOstat also showed numerous and frequent reads from disk, resulting in IOwaits for user queries. Still, there could be other explanations for that.
So, I tried forcing a cache fill by doing a pgdump. This caused the cache to mostly fill free memory -- but then Linux aggressively cleared the cache, again getting it down to around 40GB of cache within a few minutes. This seemed to be the case no matter what we did, including tinkering with the vm parameters, increasing the size of the swap file, and changing shared_buffers. This was highly peculiar; it was as if Linux was convinced that we had half as much RAM as we did.
What fixed the problem was changing the kernel version. It turns out that kernel
2.6.32-71.29.1.el6.x86_64, released by Red Hat during a routine update, has some kind of cache management issue which can't be fixed in user space. Fortunately, they now have a later kernel version out as an update.
Before:
[root ~]# free -g
total used free shared buffers cached
Mem: 94 24 70 0 0 19
[root ~]# uname -a
Linux server1.company.com 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon
Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux
After:
[root ~]# free -g
total used free shared buffers cached
Mem: 94 87 6 0 0 83
[root ~]# uname -a
Linux server1.company.com 2.6.32-220.4.2.el6.x86_64 #1 SMP Tue
Feb 14 04:00:16 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
That's more like it! Thanks to Andrew Kerr of Tigerlead for helping figure this issue out.
I don't know if other Linux distributors released the same kernel with any routine update. I haven't seen this behavior (yet) with Ubuntu, Debian, or SuSE. If you see it, please report it in the comments, or better to the appropriate mailing list.
Monday, April 23, 2012
GSOC Begins
We have accepted the projects for PostgreSQL's participation in the 2012 Google Summer of Code. They are:
When you see these students on the PostgreSQL mailing lists, please be friendly and help them out.
- JDBC Foreign Data Wrapper, by Atri, mentored by Merlin Moncure
- Document Collection Foreign Data Wrapper, by Zheng Yang (a returning student), mentored by Satoshi Nagayasu
- Implementing TABLESAMPLE, by Qi, mentored by Stephen Frost
- Better Indexing for Ranges, by Alexander, mentored by Heikki Linnakangas
- xReader Streaming xLog Reader, by Aakash, mentored by Kevin Grittner
When you see these students on the PostgreSQL mailing lists, please be friendly and help them out.
Saturday, April 14, 2012
Of Booze and Brogrammers
There's been a little bit of noise about the culture of "hard partying" at programmer conferences. Ryan complains about binge drinking, and Kevin complains about parties so loud you can't talk. While I think both of these things are undesirable, I don't see that either is on any dramatic increase overall, actually, and certainly not more than it's on the increase in American society in general (you wanna see real binge drinking? Watch Mad Men or the Food Network). However, I do agree that there is too much emphasis on high-decibel boozing at many current tech conferences, and that it should be changed.
Since I'm often a conference organizer, I wanted to approach this issue from a conference organizer's perspective. Prepare yourself for a long, rambling post about drunken parties, brogrammers, conference organizing, teetotallers, pgCon, SCALE, and middle age.
I drink. I like beer and wine, a lot. I enjoy going to parties, and have been known to attend parties at tech conferences where I got more than a little buzzed. More than once I missed Friday morning sessions at OSCON entirely. The fact that I don't do this anymore is almost entirely due to changes in my life: I'm now 41, married, and CEO of a company.
So, Ryan's feelings aside, what's wrong with loud, drunken parties at conferences? Ryan doesn't have to attend them if he doesn't want to.
Well, first, Ryan isn't alone. The career of programming in general appeals to people who don't like parties, and I think if you did a survey of any large, general developer conference you'd find that 25% to 50% of the attendees either didn't drink, didn't like parties, or both. So if boozy parties are the only form of evening entertainment at your conference, you're deliberately excluding a quarter to half of your attendees. You're encouraging them to go home in the evening, and if they do, they're liable not to come back to the conference the next day.
Speaking of the next day, large numbers of hung-over conferencegoers make for very poor attendance at morning sessions at the conference, which makes scheduling hell. Do you schedule unpopular talks first thing Sunday morning and basically write them off? Or do you schedule popular talks in hopes that it'll get people out of bed, and risk offending an important speaker?
Secondly, the parties have taken the place of the BOFs we used to have. Many conferences used to have great Birds-Of-a-Feather informal sessions in the evenings. These offered a great opportunity for projects and vendors who couldn't get on the regular session schedule to reach their users, and for group discussions as a follow-on for the sessions during the day. However, I've generally given up on trying to hold PostgreSQL BOFs at most conferences now, since I'm always head-to-head against some party with free food and beer. The last time I had a BOF at OSCON, for example, eight people attended.
Finally, there's the liability. Whether or not your conference is officially sponsoring or hosting the party, if you put it on your schedule and announce it, you are liable in the eyes of the court for bad things which happen there: property damage, alcohol poisoning, accidental injury, and sexual assault. Frankly, I'm surprised that there hasn't been an alcohol-related death or felony arrest at a major tech conference yet; it's only a matter of time.
Of course, there's quite a few reasons why we have loud parties at conferences. Among them:
Where this becomes a problem, though, is when the high-octane parties are the only things to do in the evening, or when they are emphasized as the main point of going to the conference. This is where the last reason above is important.
It's certainly easier, as a conference organizer, to get a vendor to sponsor a bash near the conference center than it is to plan other activities. All you have to do is connect the vendor's PR person with a nearby restaurant or hotel, and your work is done. Sometimes you don't even have to do that much. However, that's really not good enough; if you're too overworked to plan the conference activities after 5:30pm, then recruit a new volunteer who likes planning social things. You'll be pleasantly surprised to discover that your community does, indeed, have such people, and that they're thrilled to be able to contribute something meaningful.
Not all conferences are selling Party-Only tickets like JSConf is. In fact, some conferences have done a very good job of providing interesting alternatives to loud, hard-drinking parties for evening activities. Let me provide a few as examples you can emulate:
Unlike the O'Reilly conferences, Usenix LISA has kept their BOFs, going all evening each evening of the conference. Rather than help vendors sponsor parties, they encourage vendors to sponsor BOFs with refreshments. As a result, the BOFs at LISA are very well attended; I'd estimate that 2/3 of the conference attendees stay all evening for the BOFs.
Due to my lack of academic credentials, I usually can't get a PostgreSQL talk into the main program at LISA, but this doesn't bother me because the BOFs are so awesome. Last year there were more than 60 people in the PostgreSQL BOF, and I was able to present all about 9.1.
LinuxCon 2011 decided to have a big "20th Anniversary of Linux" party. So they rented a dance hall and had a 1920's-themed party with a swing band. Further, they encouraged attendees to bring period party clothes, and had a costumer on hand for those who didn't own any. The party had ersatz gambling, a photo booth, a raffle, and dance lessons. It was terrific, and I don't say that just because I won the raffle. For the first two hours, the music was low enough you could easily talk.
More importantly, the party was fun even if you didn't drink. Compare that with the common conference party, which has a bad DJ in a big barren hotel ballroom with all the free booze you can manage. At parties like that there's nothing to do but drink; it's too loud to talk and there's nothing else to do. Even as a drinker, that fits my definition of a sucky party and I'll go elsewhere.
The most incredibly awesome party alternative of the year was provided by SCALE10x, who booked a Saturday night "games night" after the Weakest Geek and the end of the BOFs. This included head-to-head hackable FPS games (for the hackers), vintage arcade games, board games and RPGs, including a brand-new board game there for beta testing, Nerf weapons, and a Lazertag arena. Local attendees brought their whole families, including kids to the event. It was the best time I've had after dark at a conference in years.
Sure, there was alcohol at the event. But since it was a no-host bar and there were so many distractions, nobody drank very much. I had one beer; any more would have gotten me creamed by the 12-year-olds in the Lazertag arena. And even the 25-year-olds and the brogrammers had a good time.
SCALE has, if anything, led the way in party alternatives. Friday night they have Ignite talks, and Saturday they've always had The Weakest Geek, a pseudo-game show on stage. While their Games Night may be a bit elaborate for smaller conferences, you can reasonably plan fun entertainment which doesn't require getting soused.
So this is my call to conference organizers: the professional world of programming is not a frat house. We can provide evening activities at conferences which are either an alternative to alcohol and loud music, or which are still fun even if you don't drink. Yes, it's a bit more work, but it's worth it.
I'm not saying that we shouldn't have loud parties. Parties are fun, and in demand by a large portion of your attendees. Just that we shouldn't only have boozy bashes to the exclusion of all other evening activities. It's called "moderation", and a good thing to practice in all portions of life.
And then we'll all have more fun.
Since I'm often a conference organizer, I wanted to approach this issue from a conference organizer's perspective. Prepare yourself for a long, rambling post about drunken parties, brogrammers, conference organizing, teetotallers, pgCon, SCALE, and middle age.
Why Limit Drunken Partying at Conferences?
I drink. I like beer and wine, a lot. I enjoy going to parties, and have been known to attend parties at tech conferences where I got more than a little buzzed. More than once I missed Friday morning sessions at OSCON entirely. The fact that I don't do this anymore is almost entirely due to changes in my life: I'm now 41, married, and CEO of a company.
So, Ryan's feelings aside, what's wrong with loud, drunken parties at conferences? Ryan doesn't have to attend them if he doesn't want to.
Well, first, Ryan isn't alone. The career of programming in general appeals to people who don't like parties, and I think if you did a survey of any large, general developer conference you'd find that 25% to 50% of the attendees either didn't drink, didn't like parties, or both. So if boozy parties are the only form of evening entertainment at your conference, you're deliberately excluding a quarter to half of your attendees. You're encouraging them to go home in the evening, and if they do, they're liable not to come back to the conference the next day.
Speaking of the next day, large numbers of hung-over conferencegoers make for very poor attendance at morning sessions at the conference, which makes scheduling hell. Do you schedule unpopular talks first thing Sunday morning and basically write them off? Or do you schedule popular talks in hopes that it'll get people out of bed, and risk offending an important speaker?
Secondly, the parties have taken the place of the BOFs we used to have. Many conferences used to have great Birds-Of-a-Feather informal sessions in the evenings. These offered a great opportunity for projects and vendors who couldn't get on the regular session schedule to reach their users, and for group discussions as a follow-on for the sessions during the day. However, I've generally given up on trying to hold PostgreSQL BOFs at most conferences now, since I'm always head-to-head against some party with free food and beer. The last time I had a BOF at OSCON, for example, eight people attended.
Finally, there's the liability. Whether or not your conference is officially sponsoring or hosting the party, if you put it on your schedule and announce it, you are liable in the eyes of the court for bad things which happen there: property damage, alcohol poisoning, accidental injury, and sexual assault. Frankly, I'm surprised that there hasn't been an alcohol-related death or felony arrest at a major tech conference yet; it's only a matter of time.
Why Have Boozy Bashes?
Of course, there's quite a few reasons why we have loud parties at conferences. Among them:
- Unmarried 25-Year-Olds are a large minority of the conference population, and do a lot of partying. It's part of being 25 years old.
- The Brogrammers are desperately trying to prove to themselves that, while they may be programmers, they're not geeks. Drunken parties are part of this self-deception.
- People Want To Blow Off Steam after having their brains crammed full all day. Many/most don't want to extend 9 hours of hacking into 14.
- Some Vendors/Sponsors Prefer Parties as their way of reaching your attendees, and are willing to pay for it.
- It's Easier For Overworked Conference Organizers to arrange a party than other evening activities which actually require planning.
Where this becomes a problem, though, is when the high-octane parties are the only things to do in the evening, or when they are emphasized as the main point of going to the conference. This is where the last reason above is important.
It's certainly easier, as a conference organizer, to get a vendor to sponsor a bash near the conference center than it is to plan other activities. All you have to do is connect the vendor's PR person with a nearby restaurant or hotel, and your work is done. Sometimes you don't even have to do that much. However, that's really not good enough; if you're too overworked to plan the conference activities after 5:30pm, then recruit a new volunteer who likes planning social things. You'll be pleasantly surprised to discover that your community does, indeed, have such people, and that they're thrilled to be able to contribute something meaningful.
Some Alternatives to Loud Parties
Not all conferences are selling Party-Only tickets like JSConf is. In fact, some conferences have done a very good job of providing interesting alternatives to loud, hard-drinking parties for evening activities. Let me provide a few as examples you can emulate:
Bring Back The BOFs
Unlike the O'Reilly conferences, Usenix LISA has kept their BOFs, going all evening each evening of the conference. Rather than help vendors sponsor parties, they encourage vendors to sponsor BOFs with refreshments. As a result, the BOFs at LISA are very well attended; I'd estimate that 2/3 of the conference attendees stay all evening for the BOFs.
Due to my lack of academic credentials, I usually can't get a PostgreSQL talk into the main program at LISA, but this doesn't bother me because the BOFs are so awesome. Last year there were more than 60 people in the PostgreSQL BOF, and I was able to present all about 9.1.
Parties Don't Need to Be All About Drinking
LinuxCon 2011 decided to have a big "20th Anniversary of Linux" party. So they rented a dance hall and had a 1920's-themed party with a swing band. Further, they encouraged attendees to bring period party clothes, and had a costumer on hand for those who didn't own any. The party had ersatz gambling, a photo booth, a raffle, and dance lessons. It was terrific, and I don't say that just because I won the raffle. For the first two hours, the music was low enough you could easily talk.
More importantly, the party was fun even if you didn't drink. Compare that with the common conference party, which has a bad DJ in a big barren hotel ballroom with all the free booze you can manage. At parties like that there's nothing to do but drink; it's too loud to talk and there's nothing else to do. Even as a drinker, that fits my definition of a sucky party and I'll go elsewhere.
Shall We Play A Game?
The most incredibly awesome party alternative of the year was provided by SCALE10x, who booked a Saturday night "games night" after the Weakest Geek and the end of the BOFs. This included head-to-head hackable FPS games (for the hackers), vintage arcade games, board games and RPGs, including a brand-new board game there for beta testing, Nerf weapons, and a Lazertag arena. Local attendees brought their whole families, including kids to the event. It was the best time I've had after dark at a conference in years.
Sure, there was alcohol at the event. But since it was a no-host bar and there were so many distractions, nobody drank very much. I had one beer; any more would have gotten me creamed by the 12-year-olds in the Lazertag arena. And even the 25-year-olds and the brogrammers had a good time.
SCALE has, if anything, led the way in party alternatives. Friday night they have Ignite talks, and Saturday they've always had The Weakest Geek, a pseudo-game show on stage. While their Games Night may be a bit elaborate for smaller conferences, you can reasonably plan fun entertainment which doesn't require getting soused.
Learning Moderation
So this is my call to conference organizers: the professional world of programming is not a frat house. We can provide evening activities at conferences which are either an alternative to alcohol and loud music, or which are still fun even if you don't drink. Yes, it's a bit more work, but it's worth it.
I'm not saying that we shouldn't have loud parties. Parties are fun, and in demand by a large portion of your attendees. Just that we shouldn't only have boozy bashes to the exclusion of all other evening activities. It's called "moderation", and a good thing to practice in all portions of life.
And then we'll all have more fun.
Friday, April 13, 2012
An open "Dear United" letter
Dear United:
I'm afraid that we'll no longer be seeing each other regularly. I've been seeing some new airlines, and they've made me realize what an abusive partner you've been. You've taken me for granted, cheated and abused me, and I'm not putting up with it anymore.
I fly a lot for business. As a result, I've had frequent flyer status on United/Star Alliance for years ... generally either Gold or Silver, depending on how many international trips I make. The benefits and perks you gave me kept me coming back to you, even as you fell further and further behind technologically, and have been treating your customers with steadily declining courtesy.
This year, you stripped away all of my useful benefits as a Silver member. I no longer get Economy Plus seats, free checked baggage, free upgrades, or really anything else which makes spending time with you less unpleasant. I'm in the back of the plane with the proles, and you've made the back of the plane into a pretty nasty place.
The new airlines I've been seeing (VirginAmerica, JetBlue, and Southwest), are all younger, hipper, better looking, and have much better senses of humor than you. More importantly, they appreciate me and treat me like a valued person. Jetblue and Virgin have more leg room in coach, onboard entertainment systems, wireless internet, and decent quality food and booze on sale. Southwest is plainer, but makes up for it by being "fast" (if you know what I mean), and checking my luggage for free. All three of my new airlines' web sites are an entire generation of technology ahead of yours.
We've both known this was coming since you acquired Continental. We knew that being bigger and fatter wasn't going to make you more sensitive to my needs. You've treated me shabbily, and it's time to end it.
Sincerely,
A former Mileage Plus member
P.S.: I have a suggestion for a new United motto:
"We know you don't have a choice when you fly. That's why you're on United."
I'm afraid that we'll no longer be seeing each other regularly. I've been seeing some new airlines, and they've made me realize what an abusive partner you've been. You've taken me for granted, cheated and abused me, and I'm not putting up with it anymore.
I fly a lot for business. As a result, I've had frequent flyer status on United/Star Alliance for years ... generally either Gold or Silver, depending on how many international trips I make. The benefits and perks you gave me kept me coming back to you, even as you fell further and further behind technologically, and have been treating your customers with steadily declining courtesy.
This year, you stripped away all of my useful benefits as a Silver member. I no longer get Economy Plus seats, free checked baggage, free upgrades, or really anything else which makes spending time with you less unpleasant. I'm in the back of the plane with the proles, and you've made the back of the plane into a pretty nasty place.
The new airlines I've been seeing (VirginAmerica, JetBlue, and Southwest), are all younger, hipper, better looking, and have much better senses of humor than you. More importantly, they appreciate me and treat me like a valued person. Jetblue and Virgin have more leg room in coach, onboard entertainment systems, wireless internet, and decent quality food and booze on sale. Southwest is plainer, but makes up for it by being "fast" (if you know what I mean), and checking my luggage for free. All three of my new airlines' web sites are an entire generation of technology ahead of yours.
We've both known this was coming since you acquired Continental. We knew that being bigger and fatter wasn't going to make you more sensitive to my needs. You've treated me shabbily, and it's time to end it.
Sincerely,
A former Mileage Plus member
P.S.: I have a suggestion for a new United motto:
"We know you don't have a choice when you fly. That's why you're on United."
Thursday, April 12, 2012
FOSS4G-NA 2012
I just spent the week in DC speaking at FOSS4G-NA, or as I like to call it, "Fossforgna". I came out at the invitation of the conference committee, and did a talk on "PostgreSQL 9.2 and Beyond" and a keynote on "Firehose Engineering", slides for both of which are located on our presentations page. My "BOF" on Postgres peformance was less successful, because the attendees brought really hard problems. Here's a heads-up: PostGIS desperately needs parallel query support.
The big news, of course, is the release of PostGIS 2.0. I also saw excellent talks by: CartoDB (Maps-as-a-service, running on PostGIS); Metropolitan Airports Commission (flight tracking using PostGIS and PL/R); the British Anarctic Survey (tracking pack ice with PostGIS); NOAA's mapping system for international emergencies (especially poignant on Tuesday). There were multiple talks about using opengeo stacks with PostGIS for various forms of ecological and biological tracking, including ocean floor diversity, deforestation, and endangered species tracking. Just overall, really cool stuff which makes me proud to have done a very small part in helping enable it.
The hallway track was also very busy and I met a lot of people I never knew were using PostGIS before.
There were somewhere between 300 and 400 people at FOSS4GNA, and with only a couple of exceptions, every single one was using PostGIS. The conference really brought home to me that PostGIS users probably now make up the majority of the PostgreSQL user base, and us folks in "mainstream" PostgreSQL ought to be paying more attention to that. While PostgreSQL is a highly competitive relational database, PostGIS is the unquestioned first choice for GIS applications.
In any case, a terrific conference for open source geospatial application developers of all kinds, and a great opportunity to make business contacts in the opengeo world. I recommend attending the next Fossforgna next year.
Wednesday, April 4, 2012
Reminder: GSOC applications close tommorrow
Just a quick reminder that you have less than 40 hours to apply for Google Summer of Code. If you're a student, time to get moving!
Monday, April 2, 2012
Baby you can drive my CAR
Photo courtesy K Lars Lohn. Used with permission.
Yesterday my colleague Robert Hodges made a post disputing the CAP theorem. While I have some issues with his logic -- and after all, it was a beery post put up on April 1 -- his post does bring up an issue I've meant to blog about for a while, so here goes.
CAP is an exclusionary triad: it consists of three elements, any pair of which excludes the other. Another exclusionary triad all my readers will be familiar with is PQT: Price, Quality, Time. Thing is, CAP is only one of several possible triads in distributed system design, and is not the most interesting triad for clustered database designers.
Most of the time you're ignoring Partition Tolerance for a clustered database design since it's limited to a single data center and partial network failure is not considered that great of a risk. This isn't universally true; for example, CAP rears its ugly headgear whenever you contemplate having automated failover for a failover/pooling proxy. However, there are more substantial tradeoffs to be made even in a single-network system.
There are several other possible combinations which fit the exclusionary triad model and relate more closely to database design. For example, CRT (Complexity, Response Time, Throughput) is a triad for read requests in a distributed database. I may write more about CRT later.
Today, though, I'm going to write about one of the triads which govern writes to a clustered database system. CAR: Consistency, Availability, or Response Time. You can choose any two of the three but not all three. To explain:
Consistency refers to having writes succeed or fail, and produce the same results, regardless of to which node you connect. Ideal Consistency would be the consistency of a single node.
Availability is the clustered database's ability to survive the failure of one or more nodes while still servicing client requests. Ideal Availability would support any number of failed nodes as long as one node was still running.
Response Time is the speed of service of a single write request. Ideal Response Time is the response time of a single node.
While real systems make some compromises on all elements of the triad, we can show this as a valid triad by defining the three extreme cases:
CR: a traditional, single-node transactional database is Consistent and has good Response Times. Obviously, it cannot survive node failure.
AR: a non-transactional multi-node key-value database with fully replicated data on each node, which accepts writes to any node and asynchronously attempts to copy the writes to other nodes would have perfect Availability and single-node Response Times, while completely abandoning consistency.
CA: on the other hand, consider a multi-node database in which each node has duplicate copies of the data and all transactions are written to all nodes using 2-phase commit (2PC). This database would have perfect Consistency and maximum Availability, at the cost of Response Times which get progressively slower as the cluster grows.
If you look at today's open source clustered databases, you'll find that they have all chosen one side or another of the above triangles, with some compromising in one direction or another. For example:
CR: mainstream PostgreSQL and MySQL.
AR: MongoDB, CouchDB, Bucardo.
CA: Continuent, VoltDB, Synchronous Replication, PostgresXC.
Obviously there are more design differences between the databases above than where they fall on the CAR triangle. But it's a good way to look at the tradeoffs you need to make. Probably more than CAP.