Database Soup: Finding Duplicate Indexes

Recently a client asked us to help them find and weed out duplicate indexes. We had some old queries to do this, but they tended to produce a lot of false positives, and in a database with over 2000 indexes that wasn't going to cut it. So I rewrote those queries to make them a bit more intelligent and discriminating, and to supply more information to the user on which to base decisions about whether to drop an index.

Here's the first query, which selects only indexes which have exactly the same columns. Let me explain the columns of output it produces:

schema_name, table_name, index_name: the obvious
index_cols: a comma-delimited list of index columns
indexdef: a CREATE statement for how the index was created, per pg_indexes view
index_scans: the number of scans on this index per pg_stat_user_indexes

Now, go run in on your own databases. I'll wait.

So, you probably noticed that we still get some false positives, yes? That's because an index can have all the same columns but still be different. For example, it could use varchar_pattern_ops, GiST, or be a partial index. However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same. For example, you probably don't need both an index on ( status WHERE cancelled is null ) and on ( status ).

What about indexes which contain all of the columns of another index, plus some more? Like if you have one index on (id, name) you probably don't need another index on just (id). Well, here's a query to find partial matches.

This second query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column. While a lot of these indexes might not actually be duplicates, a lot of them will be.

Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different. To create your own variations, the key is to edit the filter criteria contained in this clause:

WHERE EXISTS ( SELECT 1
    FROM pg_index as ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
     OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] = ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid
)

... and change it to figure out the factors which give you the most real duplicates without missing anything.

Happy duplicate-hunting!

4 comments:

a-altairSeptember 18, 2014 at 10:42 PM
Nice +1
Full Query For Me Is:
SELECT *
FROM pg_index AS ind
JOIN pg_class c ON ind.indexrelid = c.oid
WHERE EXISTS ( SELECT 1
FROM pg_index AS ind2
WHERE ind.indrelid = ind2.indrelid
AND ( ind.indkey @> ind2.indkey
OR ind.indkey <@ ind2.indkey )
AND ind.indkey[0] = ind2.indkey[0]
AND ind.indkey <> ind2.indkey
AND ind.indexrelid <> ind2.indexrelid
)
Mr. MuskratSeptember 19, 2014 at 10:02 AM
On 9.3 I had to add some casts to the second query:

AND ( ind.indkey::int[] @> ind2.indkey::int[]
OR ind.indkey::int[] <@ ind2.indkey::int[] )

Thursday, September 18, 2014

Finding Duplicate Indexes

4 comments: