Thursday, August 14, 2014

More on 9.4 percentiles

Some of you may be following the discussion on JSONB compression on -hackers, which is the remaining major issue for 9.4.  As part of this, I needed to run some statistics on average column sizes for JSON vs JSONB tables. This makes a good example of how incredibly useful WITHIN GROUP will be in 9.4.

First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version.  Then I ran some comparisons using pg_column_size on the JSON columns.

with colsizes_b as (
    select pg_column_size(jsonbish.jcol) as colsize
    from jsonbish
),
distrib_b as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_b
),
colsizes_j as (
    select pg_column_size(jsonic.jcol) as colsize
    from jsonic
),
distrib_j as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_j
)
select 'json' as thetype,
    colsize_distribution from distrib_j
union all
select 'jsonb',
    colsize_distribution
    from distrib_b;


So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP.  There's no group there because the group is actually the whole table.  I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.

And the results:

 thetype |    colsize_distribution   
---------+----------------------------
 json    | {1741,1767,1854,1904,2292}
 jsonb   | {3551,5866,5910,5958,6168}


... which demonstrates that we do indeed have a major issue with compressability.




No comments:

Post a Comment