... Postgres. You wanna document store? We got your document store, right here.
Thanks to Craig Kersteins, I just learned about the pgREST project. This project turns PostgreSQL into a RESTful JSON document store, capable of running ad-hoc Javascript code (either LiveScript or v8) in order to do searches and batch modififications on the server.
pgBSON, a Postgres extension implementing support for a MongoDB-compatible BSON datatype, also recently went 1.0, signalling that it's ready for production use.
These are powerful demonstration of how adaptable Postgres is. If it's data, we can find a way to do it. And unlike a typical document store database, you can do relational-SQL stuff as well, giving you your choice of interface.
pgREST is also our first major project out of our Taiwanese community, so I'm pretty excited about it.
Of course, we're still waiting for fully indexable, binary-storage JSON in 9.4, but that's on schedule for September.
Wednesday, December 11, 2013
Friday, December 6, 2013
Checksums! On December 10th
SFPUG will be hosting committer Jeff Davis talking about his 9.3 feature, data page checksums for PostgreSQL, and how to use them. As usual, we will have live video of his presentation. Presentation should start around 7:15 PST.
Thursday, December 5, 2013
Why you need to apply today's update real soon
So, another month, another PostgreSQL update. This one is a lot more critical than most because it patches up to three data loss issues, depending on what version of PostgreSQL you're on and which features you're using. While all users need to plan to apply this update sometime during the next couple weeks, two kinds of users need to schedule an update downtime for this weekend, if at all possible:
Annoyingly, you'll have to do some additional stuff after you update:
VACUUM; -- optionally, ANALYZE as well
This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.
More information about the replication issue is here.
So, how did this happen? Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions. For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed. Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.
If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!
- Users on version 9.3
- Users of binary replication
Annoyingly, you'll have to do some additional stuff after you update:
- If using binary replication, you need to take a new base backup of each replica after updating.
- You should run the following on each production database after updating:
VACUUM; -- optionally, ANALYZE as well
This second step is critical for users on 9.3, and a generally good idea for users on other versions. Note that, while VACUUM is a non-blocking operation, it can create a lot of IO, so it's a good idea to do this during a slow traffic period and monitor it for pile-ups.
More information about the replication issue is here.
So, how did this happen? Well, all three issues were unexpected side effects of fixes we applied for other issues in earlier versions. For example, the replication issue is the result of the combination of two independent fixes for failover issues, both of which needed to be fixed. Since all of these issues depend on both timing and heavy traffic, and their effect is subtle (as in, a few missing or extra rows), none of our existing testing was capable of showing them.
If anyone wants to devise destruction tests to detect subtle data corruption issues in PostgreSQL before we release code -- if anyone can -- please offer to help!
Wednesday, November 20, 2013
PostGIS Day Tommorrow
The PostGIS project is throwing a "PostGIS Day" online tommorrow. As part of this, I will be on a panel with Paul Ramsey, Stephen Mather and Bborie Park at 10am PST tommorrow (Thursday) on Google Hangout. Join us there!
Thursday, November 14, 2013
First Look at PostgreSQL RDS on Amazon
By now a lot of you have heard about the new Postgres RDS on Amazon Web Services. This is something I've been anticipating for quite a while, so I decided to log in and try it out right away. Summary is: the service is up, it's real, they did some nice things, the press release isn't completely accurate, and they did fork PostgreSQL.
First off, looking at the feature list and what they decided to include with PostgreSQL, they seem to have decided that Heroku was a good template for what they should do. So each instance includes PostGIS, hstore, PL/perl, PL/python, etc. Also they automate point-in-time-recovery in some way, although I need to test that out. I was particularly interested to see how they could support PL/python, which requires superuser access to create functions.
So I logged into pgExperts' AWS account, and spun up an instance of Postgres RDS.
This gave me warm fuzzies:
Latest version of PostgreSQL, and "auto-minor version upgrade", which is on by default. Yay! No nagging AWS users to apply the darned updates, already. There's also a bunch of options with configurations, backup configurations maintenance windows, security, etc. Altogether somewhat more "gear-head" than Heroku; you have to configure these things, AWS doesn't do it for you. By default, for example, port 5432 is blocked from the internet until you create a new "security group" to unblock it.
Deployment of my new instance took longer than I'm used to with AWS (around 15min), but that's possibly because the announcement today has meant a rush of people trying the service. Besides, AWS has me spoiled for deployment times.
First off, looking at the feature list and what they decided to include with PostgreSQL, they seem to have decided that Heroku was a good template for what they should do. So each instance includes PostGIS, hstore, PL/perl, PL/python, etc. Also they automate point-in-time-recovery in some way, although I need to test that out. I was particularly interested to see how they could support PL/python, which requires superuser access to create functions.
So I logged into pgExperts' AWS account, and spun up an instance of Postgres RDS.
This gave me warm fuzzies:
Latest version of PostgreSQL, and "auto-minor version upgrade", which is on by default. Yay! No nagging AWS users to apply the darned updates, already. There's also a bunch of options with configurations, backup configurations maintenance windows, security, etc. Altogether somewhat more "gear-head" than Heroku; you have to configure these things, AWS doesn't do it for you. By default, for example, port 5432 is blocked from the internet until you create a new "security group" to unblock it.
Deployment of my new instance took longer than I'm used to with AWS (around 15min), but that's possibly because the announcement today has meant a rush of people trying the service. Besides, AWS has me spoiled for deployment times.
Monday, November 11, 2013
SFPUG November: Gandi.net and Postgres (live video)
As usual, we plan to have live video of Wednesday's San Francisco PostgreSQL User Group meeting. This one will involve Gandi.net showing off their new PostgreSQL-As-A-Service offering, as well as talking about how Gandi.net uses PostgreSQL internally to support their services. Since PostgreSQL.org uses Gandi.net to host many of our domains, it's a wonderful friendly full-circle group hug. Tune in! We're liable to start around 7:15 PM PST.
Sunday, November 10, 2013
Moving tablespaces
PostgreSQL has a feature called "tablespaces" which allows you to put separate data directories on different mount points. There's a variety of uses for this, including: storing data somewhere different from $PGDATA, putting your hot tables on SSD, or expanding the amount of storage space available to your database without taking it down. In some cases, they can even be used to help parallelize IO access.
However, it's not infrequently the case that at some point you want to move an existing tablespace to a different mount point, because you're changing storage, re-arranging the filesystem, or making backup easier. There's two different ways to move tablespaces in current PostgreSQL: the slow online way and the fast downtime way.
The slow online way works like this:
The second approach requires a database downtime and some understanding of how PostgreSQL stores tablespaces. Note that this technique will not work on versions prior to 9.2, since those versions also store the tablespace path in a system table, which forces you to use the online method.
If you look in your $PGDATA directory, you will find a subdirectory called pg_tblspc. This directory contains entries like this:
lrwxrwxrwx 1 josh josh 26535 -> /home/josh/tblspc
Each listing is a link to a directory location. If you follow that link, you find something like this:
rwx------ 3 josh josh 4096 PG_9.3_201306121
... in the target directory, there is a subdirectory named after the exact version which created it, which actually contains your tablespace data. At this point, a different way to move the tablespace should have occurred to you, and it does work.
The reason why this is a "faster" method is that step 2 allows you to use whatever filesystem copying tricks you wish to speed things up, such as a 2-stage rsync. It also allows you to handle cases where both tablespaces won't be online at the same time.
Also note that I'm not sure this works the same way on Windows.
Hope that helps!
However, it's not infrequently the case that at some point you want to move an existing tablespace to a different mount point, because you're changing storage, re-arranging the filesystem, or making backup easier. There's two different ways to move tablespaces in current PostgreSQL: the slow online way and the fast downtime way.
The slow online way works like this:
- create a new tablespace in the desired new location
- go through all of the tables/indexes stored in the old tablespace, and do ALTER <object> TABLESPACE <new_tablespace> on each of them.
- drop the old tablespace
The second approach requires a database downtime and some understanding of how PostgreSQL stores tablespaces. Note that this technique will not work on versions prior to 9.2, since those versions also store the tablespace path in a system table, which forces you to use the online method.
If you look in your $PGDATA directory, you will find a subdirectory called pg_tblspc. This directory contains entries like this:
lrwxrwxrwx 1 josh josh 26535 -> /home/josh/tblspc
Each listing is a link to a directory location. If you follow that link, you find something like this:
rwx------ 3 josh josh 4096 PG_9.3_201306121
... in the target directory, there is a subdirectory named after the exact version which created it, which actually contains your tablespace data. At this point, a different way to move the tablespace should have occurred to you, and it does work.
- shut down the database system
- move or copy the PG_9* subdirectory from the old tablespace mount to the new one.
- change the link in pg_tblspc to point to the new mount point
- start the database system back up
The reason why this is a "faster" method is that step 2 allows you to use whatever filesystem copying tricks you wish to speed things up, such as a 2-stage rsync. It also allows you to handle cases where both tablespaces won't be online at the same time.
Also note that I'm not sure this works the same way on Windows.
Hope that helps!
Friday, November 8, 2013
ALTER TABLE and downtime, part II
Now, you'll note in part I that I said that doing certain ALTER operations was "free", in quotes. That's because these operations aren't actually worry-free; there's the major issue of locking.
Currently, any ALTER TABLE you run requires an ACCESS EXCLUSIVE lock on the altered table in order to commit. This is a lock which blocks not only writes, but reads as well. You don't see read locks much in PostgreSQL, so one tends to forget they exist until you do something like running an ALTER TABLE in a database which processes 15,000 queries per second. Then, this sort of thing can happen:
This means that, for deployments which involve ALTER TABLE, you'll need to plan on a brief downtime or at least watching the database closely if you run an extremely-high-transaction-rate database with many multi-statement transactions. In many cases I get around this by polling for a manual lock on the table, and proceeding with the ALTER once I get one. This doesn't work well with existing schema management tools, though.
I'm not just talking about ALTER TABLE ADD/DROP COLUMN here; this happens for any ALTER TABLE statement, including ALTER TABLE SET STATISTICS. Unfortunately, while PostgreSQL DDL statements are transaction-safe, our system tables themselves are not currently MVCC, which means that the only way we can modify anything in the system tables is to take an exclusive lock on it.
Fortunately, thanks to Robert Haas, that just changed in 9.4 development. In the future, we should be able to do the following operations without taking more than a share lock on the table:
There is one special case of ALTER TABLE you need to know about in addition to this, though. That's ALTER TABLE ALTER COLUMN ... TYPE <new_data_type>. More on this in a succeeding post.
DDL = Data Definition Language, i.e. CREATE/ALTER/DROP statements.
MVCC = Multi-Version Concurrency Control, the way we ensure that users can see different versions of the database depending on their current transaction context.
Currently, any ALTER TABLE you run requires an ACCESS EXCLUSIVE lock on the altered table in order to commit. This is a lock which blocks not only writes, but reads as well. You don't see read locks much in PostgreSQL, so one tends to forget they exist until you do something like running an ALTER TABLE in a database which processes 15,000 queries per second. Then, this sort of thing can happen:
- ALTER TABLE waits behind a long-running reporting or write query.
- SELECT queries start to pile-up behind ALTER TABLE statement
- server starts performing badly because of the pile-up, and is effectively locked up for a few minutes.
- the pile-up takes several more minutes to clear, even after the ALTER has gone through.
This means that, for deployments which involve ALTER TABLE, you'll need to plan on a brief downtime or at least watching the database closely if you run an extremely-high-transaction-rate database with many multi-statement transactions. In many cases I get around this by polling for a manual lock on the table, and proceeding with the ALTER once I get one. This doesn't work well with existing schema management tools, though.
I'm not just talking about ALTER TABLE ADD/DROP COLUMN here; this happens for any ALTER TABLE statement, including ALTER TABLE SET STATISTICS. Unfortunately, while PostgreSQL DDL statements are transaction-safe, our system tables themselves are not currently MVCC, which means that the only way we can modify anything in the system tables is to take an exclusive lock on it.
Fortunately, thanks to Robert Haas, that just changed in 9.4 development. In the future, we should be able to do the following operations without taking more than a share lock on the table:
- SET STATISTICS
- DROP TRIGGER/CONSTRAINT
- DROP a foreign key from a referencing table
- SET DEFAULT
- SET NOT NULL
- ADD COLUMN
There is one special case of ALTER TABLE you need to know about in addition to this, though. That's ALTER TABLE ALTER COLUMN ... TYPE <new_data_type>. More on this in a succeeding post.
DDL = Data Definition Language, i.e. CREATE/ALTER/DROP statements.
MVCC = Multi-Version Concurrency Control, the way we ensure that users can see different versions of the database depending on their current transaction context.
Labels:
dba,
deployments,
devops,
downtime,
postgresql,
sql
Tuesday, November 5, 2013
ALTER TABLE and downtime, part I
One of PostgreSQL's headline features -- at least, compared to other relational databases -- is the ability to do some ALTER TABLE ALTER COLUMN commands for "free". It's part of how we try to make things easier for DBAs and Devops staff, by not requiring a downtime for operations which require one on other databases. But there are some limitations and caveats.
First, let me give you the good news. The following operations affect metadata only and require neither rewriting the table nor any per-row verification of values:
ADD COLUMN can be free too, provided that the new column allows NULLs and doesn't have a DEFAULT value set. You might think the following two transactions are almost identical:
ALTER TABLE users ADD COLUMN hat_size TEXT DEFAULT 'L';
BEGIN;
ALTER TABLE users ADD COLUMN hat_size TEXT;
ALTER TABLE users ALTER COLUMN hat_size SET DEFAULT 'L';
COMMIT;
... but the first one will rewrite the table entirely, whereas the second one will not. If you try both of those, and look at the results, it becomes clear why:
Version 1:
user_id | hat_size
---------+----------
1021 | L
1022 | L
1023 | L
1025 | L
1026 | L
Version 2:
user_id | hat_size
---------+----------
1021 |
1022 |
1023 |
1025 |
1026 |
As you see, if you do the ADD and the DEFAULT in a single statement, Postgres helpfully pre-populates the column for you. Whereas if you do them in two statements, it doesn't, and therefore doesn't rewrite the whole table. Yes, I realize that this behavior is fairly arbitrary, but changing it at this point would break a lot of backwards compatibility. And the "auto-populate the default" feature was added in the first place (in 7.2 IIRC) because people wanted it.
Now, above I said that certain actions were "free". The reason for the quotation marks is that these ALTERS do actually have costs. More on that in my next post.
First, let me give you the good news. The following operations affect metadata only and require neither rewriting the table nor any per-row verification of values:
- DROP COLUMN
- ADD COLUMN with no DEFAULT
ADD COLUMN can be free too, provided that the new column allows NULLs and doesn't have a DEFAULT value set. You might think the following two transactions are almost identical:
ALTER TABLE users ADD COLUMN hat_size TEXT DEFAULT 'L';
BEGIN;
ALTER TABLE users ADD COLUMN hat_size TEXT;
ALTER TABLE users ALTER COLUMN hat_size SET DEFAULT 'L';
COMMIT;
... but the first one will rewrite the table entirely, whereas the second one will not. If you try both of those, and look at the results, it becomes clear why:
Version 1:
user_id | hat_size
---------+----------
1021 | L
1022 | L
1023 | L
1025 | L
1026 | L
Version 2:
user_id | hat_size
---------+----------
1021 |
1022 |
1023 |
1025 |
1026 |
As you see, if you do the ADD and the DEFAULT in a single statement, Postgres helpfully pre-populates the column for you. Whereas if you do them in two statements, it doesn't, and therefore doesn't rewrite the whole table. Yes, I realize that this behavior is fairly arbitrary, but changing it at this point would break a lot of backwards compatibility. And the "auto-populate the default" feature was added in the first place (in 7.2 IIRC) because people wanted it.
Now, above I said that certain actions were "free". The reason for the quotation marks is that these ALTERS do actually have costs. More on that in my next post.
Thursday, October 24, 2013
SFPUG Wants Your Feedback
The San Francisco PostgreSQL User Group is looking for feedback and recommendations for what we should do in 2014. As such, I've set up a survey here. We are also looking for speakers, sponsors and venue hosts; contact me or fill out the survey. Thanks!
Friday, October 18, 2013
Adjusting defaults for PostgreSQL.conf
For PostgreSQL 9.4, we've been discussing adjusting defaults for postgresql.conf to give users a better out-of-box experience. Our existing defaults were set at a time when we used SysV memory, and when the average developer laptop or low-end server had 512MB of RAM. So, a survey.
Here's some things contributors have already adjusted:
However, one thing which came up during this discussion is that many people may be running PostgreSQL on virtual machines which have less than 1GB of RAM available to PostgreSQL. I don't think they are, but we lack data. So I put up a survey to see what amounts of RAM our users really have on their smallest machines/VMs. Please fill out the survey so that we can have reasonable input for development, and if you have comments, please put them below.
Here's some things contributors have already adjusted:
- shared_buffers now goes up to 128MB automatically depending on available RAM reservation (we did this for 9.3)
- effective_cache_size is automatically set to shared_buffers X 3 (might be X 4 by the time we release)
However, one thing which came up during this discussion is that many people may be running PostgreSQL on virtual machines which have less than 1GB of RAM available to PostgreSQL. I don't think they are, but we lack data. So I put up a survey to see what amounts of RAM our users really have on their smallest machines/VMs. Please fill out the survey so that we can have reasonable input for development, and if you have comments, please put them below.
Wednesday, October 16, 2013
SFPUG: Stephen Frost Hacking Live
Tommorrow night (October 17th) we will have live streaming video of Stephen Frost presenting on how to hack PostgreSQL. Video will start around 7:40PM PDT, hopefully. This is for the October Surprise meetup.
Thursday, October 10, 2013
How urgent is that Update?
We've released a cumulative bugfix release today for all supported versions of PostgreSQL, in case you somehow missed it. This update doesn't contain anything time-critical for most users. You'll want to add it to your upgrade queue for your next scheduled downtime, but you don't need to schedule an update for this weekend. Unless, of course, you are one of the two dozen or so users who reported the bugs fixed.
However, you don't really need me to tell you that, because it's in the language of the release. More on this below.
A couple of notable fixes in this release:
Hstore-JSON extension fix: if you are using both Hstore and JSON in Postgres 9.3, you should run ALTER EXTENSION hstore UPDATE after updating your server. Due to a mistake in extension versioning, users missed out on some JSON-hstore conversion functions in 9.3.
Guarantee transmission of all WAL files: users with replication setups and stringent data retention requirements will want to apply the update sooner rather than later because of this fix. It fixes an issue where under some circumstances a promoted replica might ignore a few transactions it actually received from the master before failover, considering them lost.
Over the last dozen or so update releases, we've standardized the language we use to explain how critical the fixes in the release are for most users.
First, there's the question of what kind of update it is:
Bugfix Release or Update Release: this usually contains two to four months of accumulated bug fixes for various reported bugs. Some of these bugs may be critical data loss bugs; see the release contents to check this.
Security Release: the update contains a patch for one or more security holes. This makes the update at least somewhat urgent, because as soon as the security patch is posted online, the black hats start working on exploits (assuming they didn't before then).
Then we use some standard language about the general criticality of bugs fixed in the release:
All users of the affected versions are strongly urged to apply the update immediately: release contains a high-exposure security hole or critical data loss bug which affects all users. Stop reading, declare a downtime, and apply the update now. Fortunately, we've only had one of these in the last 6 years.
All users should apply this update as soon as possible: the release contains one or more security holes and/or major data loss bugs expected to affect most users. Schedule a downtime tonight or this weekend and apply the update.
All users of _________ should apply this update immediately/as soon as possible: the same as above, but only for users of a particular feature or extension.
Users should apply this update at the next available opportunity: the release contains some low-risk security holes and/or serious data loss bugs which only occur under certain circumstances. Updating to this release shouldn't be put off for more than a couple weeks, but you can wait for a good low-traffic window to take the downtime.
Users should update at the next regular maintenance window/scheduled downtime: this update contains nothing critical or even serious for most users; it's largely a cumulative collection of bugfixes. The update should be added to the general queue of OS, library and application updates for the production servers.
Note, however, that I said most users. Every update we push out is urgent for at least a few users who are hit by the specific bugs involved. For example, for the current release, users who depend heavily on the following features should consider updating sooner rather than later:
That's why we're reluctant to adopt any system which rates update releases on some absolute scale. It doesn't matter how critical the update is in general; it matters how critical it is to you.
However, you don't really need me to tell you that, because it's in the language of the release. More on this below.
A couple of notable fixes in this release:
Hstore-JSON extension fix: if you are using both Hstore and JSON in Postgres 9.3, you should run ALTER EXTENSION hstore UPDATE after updating your server. Due to a mistake in extension versioning, users missed out on some JSON-hstore conversion functions in 9.3.
Guarantee transmission of all WAL files: users with replication setups and stringent data retention requirements will want to apply the update sooner rather than later because of this fix. It fixes an issue where under some circumstances a promoted replica might ignore a few transactions it actually received from the master before failover, considering them lost.
Over the last dozen or so update releases, we've standardized the language we use to explain how critical the fixes in the release are for most users.
First, there's the question of what kind of update it is:
Bugfix Release or Update Release: this usually contains two to four months of accumulated bug fixes for various reported bugs. Some of these bugs may be critical data loss bugs; see the release contents to check this.
Security Release: the update contains a patch for one or more security holes. This makes the update at least somewhat urgent, because as soon as the security patch is posted online, the black hats start working on exploits (assuming they didn't before then).
Then we use some standard language about the general criticality of bugs fixed in the release:
All users of the affected versions are strongly urged to apply the update immediately: release contains a high-exposure security hole or critical data loss bug which affects all users. Stop reading, declare a downtime, and apply the update now. Fortunately, we've only had one of these in the last 6 years.
All users should apply this update as soon as possible: the release contains one or more security holes and/or major data loss bugs expected to affect most users. Schedule a downtime tonight or this weekend and apply the update.
All users of _________ should apply this update immediately/as soon as possible: the same as above, but only for users of a particular feature or extension.
Users should apply this update at the next available opportunity: the release contains some low-risk security holes and/or serious data loss bugs which only occur under certain circumstances. Updating to this release shouldn't be put off for more than a couple weeks, but you can wait for a good low-traffic window to take the downtime.
Users should update at the next regular maintenance window/scheduled downtime: this update contains nothing critical or even serious for most users; it's largely a cumulative collection of bugfixes. The update should be added to the general queue of OS, library and application updates for the production servers.
Note, however, that I said most users. Every update we push out is urgent for at least a few users who are hit by the specific bugs involved. For example, for the current release, users who depend heavily on the following features should consider updating sooner rather than later:
- Range indexes
- SP-GiST
- JSON
- regular expressions
- prepared query plans
- SSL connections
- more than 24GB (!) of work memory
That's why we're reluctant to adopt any system which rates update releases on some absolute scale. It doesn't matter how critical the update is in general; it matters how critical it is to you.
Monday, October 7, 2013
Fancy SQL Monday: Finding broken keys
One of the things which usually happens on the rare occasions when you have PostgreSQL database corruption is that indexes become corrupted so that index entries no longer match the underlying table. For regular indexes, this can be fixed with a REINDEX or pg_repack. But it becomes a real problem when the index was used for a primary key, unique index or foreign key, because then duplicate or broken data gets inserted into the tables, and attempts to rebuild the index will fail.
Database-wide, the easiest way to find these broken indexes is to do a pg_dump and pg_restore of your whole database; data which violates keys will show up as errors on the pg_restore.
The second thing we need to do is find the duplicate or missing data so that we can "clean" it and then rebuild the index. First, we'll need to disable index access to make sure that we don't hit the index and get erroneous information back:
SET enable_indexscan = off;
SET enable_bitmapscan = off;
SET enable_indexonlyscan = off; -- 9.2+ only!
So for our first SQL trick, we want to query a table with a simple SERIAL surrogate key and find duplicate IDs. Then, for each of these IDs, we want to get the entire referenced row so we can look at it and decide how to fix the data. One way to do that is using a WITH clause and an aggregate:
WITH dups AS (
SELECT ticketid, count(*)
FROM tickets
GROUP BY ticketid
HAVING count(*) > 1 )
SELECT *
FROM tickets
JOIN dups USING ( ticketid )
ORDER BY ticketid;
And that gives us a list of duplicate ticketids. From there, we can choose to delete rows, assign new IDs, or do something else, depending on the nature of the duplication. Of course, if you never created any real keys (in addition to the surrogate key), you're going to have a devil of a time figuring things out.
However, the above query format isn't the only way we can search for duplicate IDs. We can also use a windowing query, which is often easier to type if we have multicolumn unique constraints:
WITH dupcount AS (
SELECT comments.*,
count(*) OVER (
PARTITION BY post, commentor, commented_at)
AS entrycount
FROM comments )
SELECT * FROM dupcount
WHERE entrycount > 1
ORDER BY post, commentor, commented_at;
Finding broken foreign keys is a bit more straightforwards. In this case, you just want to use an outer join to find the missing parent records:
SELECT shoppingcart_item.*
FROM shoppingcart_item
LEFT OUTER JOIN shoppingcart
ON shoppingcart_item.shoppingcart_id
= shoppingcart.shoppingcart_id
WHERE shoppingcart.shoppingcart_id IS NULL
ORDER BY shoppingcart_item.shoppingcart_id;
Once you've found those, though, you can't do much more than dump them to a file and delete the records, since there's no way to find out what the missing parent record data was.
Happy de-corrupting!
Database-wide, the easiest way to find these broken indexes is to do a pg_dump and pg_restore of your whole database; data which violates keys will show up as errors on the pg_restore.
The second thing we need to do is find the duplicate or missing data so that we can "clean" it and then rebuild the index. First, we'll need to disable index access to make sure that we don't hit the index and get erroneous information back:
SET enable_indexscan = off;
SET enable_bitmapscan = off;
SET enable_indexonlyscan = off; -- 9.2+ only!
So for our first SQL trick, we want to query a table with a simple SERIAL surrogate key and find duplicate IDs. Then, for each of these IDs, we want to get the entire referenced row so we can look at it and decide how to fix the data. One way to do that is using a WITH clause and an aggregate:
WITH dups AS (
SELECT ticketid, count(*)
FROM tickets
GROUP BY ticketid
HAVING count(*) > 1 )
SELECT *
FROM tickets
JOIN dups USING ( ticketid )
ORDER BY ticketid;
And that gives us a list of duplicate ticketids. From there, we can choose to delete rows, assign new IDs, or do something else, depending on the nature of the duplication. Of course, if you never created any real keys (in addition to the surrogate key), you're going to have a devil of a time figuring things out.
However, the above query format isn't the only way we can search for duplicate IDs. We can also use a windowing query, which is often easier to type if we have multicolumn unique constraints:
WITH dupcount AS (
SELECT comments.*,
count(*) OVER (
PARTITION BY post, commentor, commented_at)
AS entrycount
FROM comments )
SELECT * FROM dupcount
WHERE entrycount > 1
ORDER BY post, commentor, commented_at;
Finding broken foreign keys is a bit more straightforwards. In this case, you just want to use an outer join to find the missing parent records:
SELECT shoppingcart_item.*
FROM shoppingcart_item
LEFT OUTER JOIN shoppingcart
ON shoppingcart_item.shoppingcart_id
= shoppingcart.shoppingcart_id
WHERE shoppingcart.shoppingcart_id IS NULL
ORDER BY shoppingcart_item.shoppingcart_id;
Once you've found those, though, you can't do much more than dump them to a file and delete the records, since there's no way to find out what the missing parent record data was.
Happy de-corrupting!
Friday, October 4, 2013
De-corrupting TOAST Tables
In a database which has been mildly corrupted, the second issue you're liable to run into is corrupt TOAST tables. You'll know you have one because you'll be running a SELECT against the table or a pg_dump, and you'll get an error message like this one:
ERROR: unexpected chunk size 226 (expected 175) in final chunk 1 for toast value 8846932 in pg_toast_2877446
TOAST tables are where PostgreSQL stores offline compressed data for large values, like big text fields, geometries, large arrays, and BLOBs. What the server is telling you in that error message is that the size of the compressed block in the TOAST table doesn't match its metadata, and as a result Postgres can't decompress it.
Among other things, this is generally an indication that either you've found a PostgreSQL bug (which you should report, if you're running on the latest update version), or you've had a serious corruption event, if you didn't know already. So you should definitely do a full corruption check on the database.
But, what to do about the bad TOAST record?
Obviously, if you have a backup from just before the corruption occurred the answer is to restore that backup. But if you don't: this is fortunately hand-correctable.
The first thing to do is to find the actually row or rows where the corruption is. I do that with a simple DO script like this:
DO $f$
declare
curid INT := 0;
vcontent TEXT;
badid INT;
begin
FOR badid IN SELECT message_id FROM messages LOOP
curid = curid + 1;
if curid % 100000 = 0 then
raise notice '% rows inspected', curid;
end if;
begin
SELECT contents
INTO vcontent
FROM messages where message_id = badid;
vcontent := substr(vcontent,1000,2000);
exception
when others then
raise notice 'data for message % is corrupt', badid;
continue;
end;
end loop;
end;
$f$;
The goal here is to reach into the TOASTed value for each row (the large-text contents field) and de-compress it, which should raise any errors you're going to get. Of course, you need to tailor the above for your own tables.
Once you've identified one or more rows, there are 3 steps you need to take to clean up the corrupted TOAST record. Note that this does not involve recovering the corrupted value: that's already lost, and you'll do yourself a favor by not trying to hang onto it. Get that from your backups.
Once you've located the row or rows which have corrupt TOAST data, you need to do three things to make sure that the DB is clean:
REINDEX TABLE pg_toast.pg_toast_2877446;
Breaking the link is the most obvious step:
UPDATE messages SET contents = '' WHERE message_id = 432212;
By setting the TOASTED column to a new value, we make sure that the "current" version of the row does not link to the bad TOAST entry. I did say we'd be destroying data here, yes?
But that's not all we have to do. The old row version still exists, and will continue to produce errors every time VACUUM visits it (preventing VACUUMs of that table from completing). So we need to build a new version of the table which does not include that old row version.
Fortunately, there is a utility to do this: the super-awesome pg_repack, by Daniele Varrazzo, Josh Kupershmidt and Itagaki Takahiro. This utility will save your life someday, so buy Daniele, Josh and/or Itagaki a drink of their choice if you see them.
pg_repack -n -t messages messagedb
Now, while pg_repack does not need an exclusive lock for most of the rebuild, it does need three things, which you need to be prepared for:
Wait, though, what's this?
ERROR: could not create unique index "messages_pkey"
DETAIL: Key (message_id)=(795437) is duplicated.
Oh-oh. Looks like we have more work to do. See you Monday!
ERROR: unexpected chunk size 226 (expected 175) in final chunk 1 for toast value 8846932 in pg_toast_2877446
TOAST tables are where PostgreSQL stores offline compressed data for large values, like big text fields, geometries, large arrays, and BLOBs. What the server is telling you in that error message is that the size of the compressed block in the TOAST table doesn't match its metadata, and as a result Postgres can't decompress it.
Among other things, this is generally an indication that either you've found a PostgreSQL bug (which you should report, if you're running on the latest update version), or you've had a serious corruption event, if you didn't know already. So you should definitely do a full corruption check on the database.
But, what to do about the bad TOAST record?
Obviously, if you have a backup from just before the corruption occurred the answer is to restore that backup. But if you don't: this is fortunately hand-correctable.
The first thing to do is to find the actually row or rows where the corruption is. I do that with a simple DO script like this:
DO $f$
declare
curid INT := 0;
vcontent TEXT;
badid INT;
begin
FOR badid IN SELECT message_id FROM messages LOOP
curid = curid + 1;
if curid % 100000 = 0 then
raise notice '% rows inspected', curid;
end if;
begin
SELECT contents
INTO vcontent
FROM messages where message_id = badid;
vcontent := substr(vcontent,1000,2000);
exception
when others then
raise notice 'data for message % is corrupt', badid;
continue;
end;
end loop;
end;
$f$;
The goal here is to reach into the TOASTed value for each row (the large-text contents field) and de-compress it, which should raise any errors you're going to get. Of course, you need to tailor the above for your own tables.
Once you've identified one or more rows, there are 3 steps you need to take to clean up the corrupted TOAST record. Note that this does not involve recovering the corrupted value: that's already lost, and you'll do yourself a favor by not trying to hang onto it. Get that from your backups.
Once you've located the row or rows which have corrupt TOAST data, you need to do three things to make sure that the DB is clean:
- Re-index the TOAST table (in case it's the index which is corrupt)
- Break the link between the live row and the corrupt TOAST record.
- Rebuild the regular table in order to expunge the bad TOAST pointer.
REINDEX TABLE pg_toast.pg_toast_2877446;
Breaking the link is the most obvious step:
UPDATE messages SET contents = '' WHERE message_id = 432212;
By setting the TOASTED column to a new value, we make sure that the "current" version of the row does not link to the bad TOAST entry. I did say we'd be destroying data here, yes?
But that's not all we have to do. The old row version still exists, and will continue to produce errors every time VACUUM visits it (preventing VACUUMs of that table from completing). So we need to build a new version of the table which does not include that old row version.
Fortunately, there is a utility to do this: the super-awesome pg_repack, by Daniele Varrazzo, Josh Kupershmidt and Itagaki Takahiro. This utility will save your life someday, so buy Daniele, Josh and/or Itagaki a drink of their choice if you see them.
pg_repack -n -t messages messagedb
Now, while pg_repack does not need an exclusive lock for most of the rebuild, it does need three things, which you need to be prepared for:
- enough disk space to make a full copy of the table and its indexes.
- a short exclusive lock at the very end of processing, including on related tables by FK.
- to wait for all long-running transactions to complete
Wait, though, what's this?
ERROR: could not create unique index "messages_pkey"
DETAIL: Key (message_id)=(795437) is duplicated.
Oh-oh. Looks like we have more work to do. See you Monday!
Thursday, October 3, 2013
First thing you do is take a backup
Last week seems to have been a week for "pilot error": not just on SouthWest, but among our clientele at PGX. Three customers in two days managed to corrupt their databases through sysadmin mistakes, such as remounting the SAN with PostgreSQL running. Which reminded me of an important tip for what you should do whenever anything goes really wrong with your database server:
Many things you will try to rescue a corrupt database or a damaged server can cause further destruction of data. Worse, certain kinds of corruption (caused by bad storage or bad RAM, for example) will spread to the replicas and/or your backups given time. In a worst case, you can fiddle with your damaged DB server long enough that the regularly scheduled backup kicks in ... and overwrites a good backup with a corrupt one.
Of course, maintaining a few days or weeks of rotational backups is an even better idea, and protects you against many forms of data loss. I recommend Barman or pgRollDump for this purpose.
- Shut down PostgreSQL if it's running.
- Check if you have current valid backups and/or replicas which are not already corrupt. If so, make sure these are protected.
- If there are no backups, make a full copy of the database files and all transaction logs.
Many things you will try to rescue a corrupt database or a damaged server can cause further destruction of data. Worse, certain kinds of corruption (caused by bad storage or bad RAM, for example) will spread to the replicas and/or your backups given time. In a worst case, you can fiddle with your damaged DB server long enough that the regularly scheduled backup kicks in ... and overwrites a good backup with a corrupt one.
Of course, maintaining a few days or weeks of rotational backups is an even better idea, and protects you against many forms of data loss. I recommend Barman or pgRollDump for this purpose.
Saturday, September 21, 2013
Free VMs for Reviewers
In case you somehow missed it, the second CommitFest for PostgreSQL 9.4 development is underway. While we don't have quite as many patches in the queue as we had last time, we have quite a few, and more than we've had for any CF2 before. This means we need as many reviewers as possible!
In order to make sure that nobody is prevented from reviewing due to not having a suitable development environment available, the PostgreSQL community is offering free virtual machines for reviewing and testing patches for this CommitFest.
If you want a VM for this purpose, please email me the following at any point during the CF:
VMs will come equipped with all tools required to build PostgreSQL, plus a checkout of the PostgreSQL code from the beginning of the CF. You may need to install additional requirements yourself if you are reviewing extensions or special features with dependencies.
Note that I may take a day to respond with your VM access.
In order to make sure that nobody is prevented from reviewing due to not having a suitable development environment available, the PostgreSQL community is offering free virtual machines for reviewing and testing patches for this CommitFest.
If you want a VM for this purpose, please email me the following at any point during the CF:
- Your public SSH key.
- Which patch(es) you intend to review/test.
- Preference for CentOS or Ubuntu.
- When you expect to do the review.
VMs will come equipped with all tools required to build PostgreSQL, plus a checkout of the PostgreSQL code from the beginning of the CF. You may need to install additional requirements yourself if you are reviewing extensions or special features with dependencies.
Note that I may take a day to respond with your VM access.
Monday, September 9, 2013
More about my favorite 9.3 Features (video and more)
If you somehow missed it, PostgreSQL 9.3.0 is now available, just in time for your back-to-school shopping. 9.3 includes lots of great stuff, some of which I've already been using in development, and I'll tell you more about my favorites below. There's also a survey and we'll have live video of SFPUG doing 9.3 on Thursday.
We didn't emphasize this in the release announcement -- mainly because it's like removing a wart, you won't want to talk about it -- but this is the one 9.3 change liable to make life easier for more developers than any other. We've stopped using SysV memory for anything other than the postmaster startup lock, which means that you can now adjust shared_buffers to your heart's content without needing to mess with sysctl.conf. Let alone the satanic incantations you have to go through on the Mac.
This also clears one of the main barriers to writing simple autotuning scripts. Which means I'm out of excuses for not having written one.
Need a daemon to do background work alongside Postgres, such as scheduling, queueing, maintenance, or replication? Maybe you want to intercept MongoDB-formatted queries and rewrite them for Postgres? Custom background workers allow you to create your own "autovacuum daemon" which does whatever you want it to.
Michael Paquier will be presenting Background Workers for SFPUG on Thursday the 12th (7:30PM PDT). Details on our Meetup Page, including a link to live video for those of you not in the Bay Area.
This has been my biggest desire since 9.2 came out; we were so close to not needing to worry about archiving, ever, for small databases. And now we're there. You can make chains of replicas, fail over to one of them, remaster, make a replica at a new data center the master, change the direction of replication, and lots more configurations without needing to worry about WAL archiving and all its overhead.
If you combine this with Heikki's work on pg_rewind, things get even more flexible since you don't have to resnapshot for failback anymore.
I'll be presenting a live demo of this feature at the SFPUG meeting, including connecting replicas in a ring (all replicas -- no master!).
So, what's your favorite 9.3 feature? Vote here!
No More SHMMAX
We didn't emphasize this in the release announcement -- mainly because it's like removing a wart, you won't want to talk about it -- but this is the one 9.3 change liable to make life easier for more developers than any other. We've stopped using SysV memory for anything other than the postmaster startup lock, which means that you can now adjust shared_buffers to your heart's content without needing to mess with sysctl.conf. Let alone the satanic incantations you have to go through on the Mac.
This also clears one of the main barriers to writing simple autotuning scripts. Which means I'm out of excuses for not having written one.
Custom Background Workers
Need a daemon to do background work alongside Postgres, such as scheduling, queueing, maintenance, or replication? Maybe you want to intercept MongoDB-formatted queries and rewrite them for Postgres? Custom background workers allow you to create your own "autovacuum daemon" which does whatever you want it to.
Michael Paquier will be presenting Background Workers for SFPUG on Thursday the 12th (7:30PM PDT). Details on our Meetup Page, including a link to live video for those of you not in the Bay Area.
Streaming-Only Cascading
This has been my biggest desire since 9.2 came out; we were so close to not needing to worry about archiving, ever, for small databases. And now we're there. You can make chains of replicas, fail over to one of them, remaster, make a replica at a new data center the master, change the direction of replication, and lots more configurations without needing to worry about WAL archiving and all its overhead.
If you combine this with Heikki's work on pg_rewind, things get even more flexible since you don't have to resnapshot for failback anymore.
I'll be presenting a live demo of this feature at the SFPUG meeting, including connecting replicas in a ring (all replicas -- no master!).
So, what's your favorite 9.3 feature? Vote here!
Friday, August 23, 2013
PostgreSQL plus Vertica on Tuesday: SFPUG Live Video
This upcoming Tuesday, the 27th, SFPUG will have live streaming video of Chris Bohn from Etsy talking about how he uses PostgreSQL and Vertica together to do data analysis of Etsy's hundreds of gigabytes of customer traffic. barring technical difficulties with the video or internet, of course.
The video will be on the usual SFPUG Video Channel. It is likely to start around 7:15PM PDT. Questions from the internet will be taken on the attached chat channel.
For those in San Francisco, this event will be held at Etsy's new downtown SF offices, and Etsy is sponsoring a Tacolicious taco bar. Of course, the event is already full up, but you can always join the waiting list.
In other, related events, sfPython will be talking about PostgreSQL performance, and DjangoSF will be talking about multicolumn joins, both on Wednesday the 28th. I'll be at DjangoSF, doing my "5 ways to Crash Postgres" lightning talk.
The video will be on the usual SFPUG Video Channel. It is likely to start around 7:15PM PDT. Questions from the internet will be taken on the attached chat channel.
For those in San Francisco, this event will be held at Etsy's new downtown SF offices, and Etsy is sponsoring a Tacolicious taco bar. Of course, the event is already full up, but you can always join the waiting list.
In other, related events, sfPython will be talking about PostgreSQL performance, and DjangoSF will be talking about multicolumn joins, both on Wednesday the 28th. I'll be at DjangoSF, doing my "5 ways to Crash Postgres" lightning talk.
Thursday, August 8, 2013
Another reason to set pg_hba.conf restrictively
pg_hba.conf can be a pain in the neck. You're expecting some burst traffic, you deploy 8 new virtual appservers, fire them up, and ... they can't connect to Postgres. Gah!
That kind of experience makes it tempting to set everything to:
host all all 0.0.0.0/0 md5
But I just encountered another case this month why you shouldn't. While everyone thinks about malefactors as the main reason to have tight restrictions on which hosts can connect via pg_hba.conf, one should remember that accidents are far more common than malice. As was the case this time.
I set up a benchmark test for a client which involved replaying 8 hours of activity logs against a test copy of their database. What they didn't tell me (because they didn't know due to poor internal documentation) was that their database used DBlink to write data to another database several times a minute.
Thank goodness they had exercised good pg_hba.conf practices! All I got was a bunch of DBlink errors, instead of scribbling all over production data. Whew!
That kind of experience makes it tempting to set everything to:
host all all 0.0.0.0/0 md5
But I just encountered another case this month why you shouldn't. While everyone thinks about malefactors as the main reason to have tight restrictions on which hosts can connect via pg_hba.conf, one should remember that accidents are far more common than malice. As was the case this time.
I set up a benchmark test for a client which involved replaying 8 hours of activity logs against a test copy of their database. What they didn't tell me (because they didn't know due to poor internal documentation) was that their database used DBlink to write data to another database several times a minute.
Thank goodness they had exercised good pg_hba.conf practices! All I got was a bunch of DBlink errors, instead of scribbling all over production data. Whew!
Tuesday, August 6, 2013
Speaker Boot Camp at LinuxCon
Are you technically expert but locutionarily inept? Are you giving your first or second talk this year? Do you present frequently, but don't feel like you're "connecting" with your audience? Are you going to LinuxCon in New Orleans?
Then you should attend the Give a Great Tech Talk workshop the evening before LinuxCon. It's a sort of "speaker boot camp"; if you are speaking at LinuxCon, or better are attending LinuxCon and planning to speak at other conferences in 2013-2014, you should consider attending.
So that speakers can have coaching before the beginning of the conference, the session will be from 5pm to 8:30pm on Sunday the 15th (concurrent with the Meet and Greet), before the start of LinuxCon, at the Hyatt. The Linux Foundation will provide food. Join us and learn to deliver a better talk!
Then you should attend the Give a Great Tech Talk workshop the evening before LinuxCon. It's a sort of "speaker boot camp"; if you are speaking at LinuxCon, or better are attending LinuxCon and planning to speak at other conferences in 2013-2014, you should consider attending.
So that speakers can have coaching before the beginning of the conference, the session will be from 5pm to 8:30pm on Sunday the 15th (concurrent with the Meet and Greet), before the start of LinuxCon, at the Hyatt. The Linux Foundation will provide food. Join us and learn to deliver a better talk!
Monday, August 5, 2013
Fancy SQL Monday: generate_series() and cross joins
So, here's the situation: you want to create a report which shows totals of registrations by day and by category, and shows zeros for each day and category you don't have any registrations for. Your first attempt looks like this:
SELECT date_trunc('day',registered_on) as registration_date,
category,
count(*) as no_registrations,
sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;
However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results. You want it to display a zero. So, how do you do that?
Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented. The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN. Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server. But this time, you want one.
To do a cross join, though, you need the set of all days. This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue! generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval. So, like so:
generate_series('2013-07-01','2013-08-01',interval '1 day')
Putting that together with the categories in a cross join, we get this:
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
That give us our grid. Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations. The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.
WITH cat_days AS (
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
)
SELECT cat_days.reg_day,
cat_days.category,
count(registrations.registration_id) as registrations,
coalesce(sum(fee),0.0) as total_earned
FROM cat_days
LEFT JOIN registrations ON
cat_days.category = registrations.category
AND cat_days.reg_day <= registrations.registered_on
AND ( cat_days.reg_day + interval '1 day') >
registrations.registered_on
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;
So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful. Enjoy!
SELECT date_trunc('day',registered_on) as registration_date,
category,
count(*) as no_registrations,
sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;
However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results. You want it to display a zero. So, how do you do that?
Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented. The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN. Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server. But this time, you want one.
To do a cross join, though, you need the set of all days. This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue! generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval. So, like so:
generate_series('2013-07-01','2013-08-01',interval '1 day')
Putting that together with the categories in a cross join, we get this:
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
That give us our grid. Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations. The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.
WITH cat_days AS (
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
)
SELECT cat_days.reg_day,
cat_days.category,
count(registrations.registration_id) as registrations,
coalesce(sum(fee),0.0) as total_earned
FROM cat_days
LEFT JOIN registrations ON
cat_days.category = registrations.category
AND cat_days.reg_day <= registrations.registered_on
AND ( cat_days.reg_day + interval '1 day') >
registrations.registered_on
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;
So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful. Enjoy!
Subscribe to:
Posts (Atom)