So the real test of the different tagging data structures and types is how they perform. The tests below were run on a 10 million document database, on an AWS m3.large virtual server. Importantly, the total database size is around 7GB on a machine with 3.75GB RAM, so I've deliberately sized things so that none of the tables will fit into shared_buffers.
So our first check was for size, including indexes. All other things being equal, a smaller table and indexes is faster to query, and takes up less RAM you'd like to use to cache other data. So smaller is better.
tagtest=# select relname,
pg_size_pretty(pg_total_relation_size(relid))
from pg_stat_user_tables;
relname | pg_size_pretty
----------------+----------------
tags | 96 kB
doc_tags_id | 1551 MB
doc_tags_text | 2106 MB
doc_tags_json | 1285 MB
doc_tags_array | 1036 MB
How about a graph of that?
The text array is a clear winner here, half the size of the default text tag option. Both of the advanced data types are smaller than any of the "normalized" versions. This is largely because of per-row overhead, which is 24 bytes per row, and really adds up.
So for our first exercise, we'll want to compare the amount of time required to retrieve all of the tags for a single document, which is probably the most common single operation. In this case I tested both documents with 1 tag (75% of all documents) and documents with 9 tags (< 1% of documents).
As you can see, these are all pretty close, except for the IDs method. That's because the doc_tags_id table needs to join against the tags table to look up any tag; that's a major drawback of the surrogate key approach. We'll see that penalty in other tests. Also, notice that both the IDs and the text approach take more time the more tags a document has, whereas the JSONB and array data take constant time, because they are each looking up one row regardless.
The next test is to simulate looking up paginated search results when searching by tag. This comes in three parts: first, we search for a common tag and retrieve the first "page" of 25 results. Then we search for the 11th page of 25 results, since sometimes high offsets can trigger very different query plans, and are a common issue in searches. The queries for JSONB and arrays use the "contains" operator to do a GIN index search, and look like this:
select doc_id
from doc_tags_json
where tags @> '["beta"]'::jsonb
order by doc_id limit 25 offset 0;
select doc_id
from doc_tags_array
where tags @> ARRAY['math']
order by doc_id limit 25 offset 0;
Then we search for an uncommon tag, where to get 25 results requires checking an entire index or the base table.
> Index Scan using doc_tags_json_doc_id_idx on doc_tags_json
Why is it using the doc_id index when the JSONB index is available? Well, when we implemented JSONB for 9.4, one of the unimplemented features was having a stats model for containment searches. As we've done with new data types in the past, we simply use "contsel" as our selectivity estimate, which means that all "contains" queries on JSONB fields estimate that 1% of rows will be returned. This causes the planner to mis-estimate that it can get the first 25 rows faster by searching the doc_id index and applying the JSONB condition as a filter.
I can force the planner to use the JSONB index for rare tags, but then it will use that index for common tags as well. That's bad; the doc_id index is actually much faster for common tags. So I can have fast searches for rare tags, or fast searches for common tags, but not both. Looks like that knocks JSONB out of the running.
Continued in Part 3.
Thank you for this comparative analysis for a very common problem!
ReplyDeleteI looked it up and pg_total_relation_size includes indexes, so I'm impressed with how small the GIN indexes must be for JSONB and array.
I think there's a typo: the per-row overhead is 24 bytes, not 14.
ReplyDeleteCan you show a representative sample of the tags, and some details of the distribution? e.g. the top ten tags by usage and their count, and the top ten counts, the median+mean tag usage, the total number of tags, the total number of docs, the top ten tag counts on a single doc, the median+mean tags per doc, and the average tag length.
ReplyDeleteThat would make it easier to judge whether this test is representative for a readers use-case.