The issue, as I understand it, was that autovaccum has been (since 8.3.0) re-using a pointer for multiple passes over the same leaf page while pruning dead nodes in b-tree indexes. If this occurred at the same time a concurrent process was inserting an new index pointer into dead space on the same leaf page at the same time, autovacuum could cause that index pointer to be flagged dead and removed. As a result, a row would exist in the table but not in the index -- a form of mild data loss or data corruption.
The above depends on split-millesecond timing to occur, which is how we were able to miss the issue for so long. The odds of it happening for any particular insert, even on a busy system, were miniscule, maybe 1 in a million. Further, it required frequent inserts into a table which also had heavy updates and deletes. Not much to worry about overall, yes? Well, no.
This is where the million monkeys come in. Just as the apochryphal mega-primate literary club will, eventually, due to random typewriter-key-mashing, produce War and Peace, with enough database activity your chances of hitting this bug move from "highly improbable" to "nearly certain". Every high-traffic production database has at least one table which receives heavy inserts, updates, and deletes (think queue table), and given a year or so might have a couple billion writes and a million autovacuum runs. Suddenly "one in a million" timing becomes 99.5% probability.
Indeed, on examination, we've found this kind of index corruption on four production customer databases. And we haven't checked nearly all of them.
Obviously this means you should update your production databases as soon as you can (downtime this weekend, perhaps?). Folks have also asked me about how to check for this form of index corruption. It's actually quite simple:
First, do a count of the rows in the table by full table scan:
staffos=# SELECT count(*) FROM candidates;
count
-------
1449
(1 row)
Then, force an index scan, and do a count(*) using a condition you know to be universally true for an indexed column in the table:
staffos=# set enable_seqscan = false;
SET
staffos=# select count(*) from candidates where id > 0;
count
-------
1448
(1 row)
Results like the above would indicate that you've encountered the index corruption bug. Note that you need to test each index to be sure. You fix it by reindexing:
staffos=# REINDEX INDEX candidates_pkey;
REINDEX
staffos=# select count(*) from candidates where id > 0;
count
-------
1449
(1 row)
Hopefully this helps with the bug. Our apologies for not catching this issue earlier, but in our defense, PostgreSQL's record on subtle bugs like this is much better than competing databases. This does show you why it's vital to stay up to date with the most recent update releases, though!