Tuesday, August 26, 2014
Portland schedule, revised
Just so folks know: due to unsurmountable issues, I will not make it to DjangoCon this year. My apologies for anyone who was planning on attending my sessions there. However, I will be at both the Portland pgDay and FOSS4G. See you in Portland, just later.
Monday, August 25, 2014
The great JSONB tradeoff
One thing we're waiting on to get 9.4 out the door is making a decision on JSONB storage. Currently under debate are patches to the JSONB format which would, for users with large numbers (as in 150+) of top-level keys, cause JSONB to take up to 60% less space on disk in exchange for an up to 80% increase (that is slowdown) in response times on key extraction. We need your feedback as potential users of the feature to make a good decision about what's best.
At stake is the question of whether the key locations should be saved as successive offsets, or as lengths. The advantage of the former is that it speeds up extraction by making it only two lookups to locate a top-level key regardless of the number of keys you have. The disadvantage is that a series of increasing offsets isn't very compressible.
For JSONB fields which consist of a few top-level keys and large values, this question makes no difference at all. However, for the worst case ... 150+ top-level keys with short (under 10 bytes) values, the difference is quite dramatic. For example, I constructed a test with 183 keys, of which 175 were NUMERIC. I checked both table size and time to extract key #160 from 100,000 rows:
This is a "worst case" scenario for the difference between these two designs. Note that the extraction slowdown affects only retrieving the value to the client; it does not affect index lookups of JSONB rows, which are speedy no matter which patch is employed.
However, we're undecided on this "fix" because we don't know a couple things:
At stake is the question of whether the key locations should be saved as successive offsets, or as lengths. The advantage of the former is that it speeds up extraction by making it only two lookups to locate a top-level key regardless of the number of keys you have. The disadvantage is that a series of increasing offsets isn't very compressible.
For JSONB fields which consist of a few top-level keys and large values, this question makes no difference at all. However, for the worst case ... 150+ top-level keys with short (under 10 bytes) values, the difference is quite dramatic. For example, I constructed a test with 183 keys, of which 175 were NUMERIC. I checked both table size and time to extract key #160 from 100,000 rows:
Metric | Offsets | Lengths |
---|---|---|
Table Size | 1147 MB | 541 MB |
Extract 100K | 2.8s | 5.2s |
This is a "worst case" scenario for the difference between these two designs. Note that the extraction slowdown affects only retrieving the value to the client; it does not affect index lookups of JSONB rows, which are speedy no matter which patch is employed.
However, we're undecided on this "fix" because we don't know a couple things:
- How likely are users to have 150+ top-level keys in one field (or keys on any single level together) with short values?
- Is up to 60% space savings in return for up to 80% extraction slowdown a good tradeoff? Or a bad one?
Thursday, August 14, 2014
More on 9.4 percentiles
Some of you may be following the discussion on JSONB compression on -hackers, which is the remaining major issue for 9.4. As part of this, I needed to run some statistics on average column sizes for JSON vs JSONB tables. This makes a good example of how incredibly useful WITHIN GROUP will be in 9.4.
First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version. Then I ran some comparisons using pg_column_size on the JSON columns.
with colsizes_b as (
select pg_column_size(jsonbish.jcol) as colsize
from jsonbish
),
distrib_b as (
select percentile_cont(array [0,0.25,0.5,0.75,1])
within group (order by colsize) as colsize_distribution
from colsizes_b
),
colsizes_j as (
select pg_column_size(jsonic.jcol) as colsize
from jsonic
),
distrib_j as (
select percentile_cont(array [0,0.25,0.5,0.75,1])
within group (order by colsize) as colsize_distribution
from colsizes_j
)
select 'json' as thetype,
colsize_distribution from distrib_j
union all
select 'jsonb',
colsize_distribution
from distrib_b;
So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP. There's no group there because the group is actually the whole table. I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.
And the results:
thetype | colsize_distribution
---------+----------------------------
json | {1741,1767,1854,1904,2292}
jsonb | {3551,5866,5910,5958,6168}
... which demonstrates that we do indeed have a major issue with compressability.
First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version. Then I ran some comparisons using pg_column_size on the JSON columns.
with colsizes_b as (
select pg_column_size(jsonbish.jcol) as colsize
from jsonbish
),
distrib_b as (
select percentile_cont(array [0,0.25,0.5,0.75,1])
within group (order by colsize) as colsize_distribution
from colsizes_b
),
colsizes_j as (
select pg_column_size(jsonic.jcol) as colsize
from jsonic
),
distrib_j as (
select percentile_cont(array [0,0.25,0.5,0.75,1])
within group (order by colsize) as colsize_distribution
from colsizes_j
)
select 'json' as thetype,
colsize_distribution from distrib_j
union all
select 'jsonb',
colsize_distribution
from distrib_b;
So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP. There's no group there because the group is actually the whole table. I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.
And the results:
thetype | colsize_distribution
---------+----------------------------
json | {1741,1767,1854,1904,2292}
jsonb | {3551,5866,5910,5958,6168}
... which demonstrates that we do indeed have a major issue with compressability.
Tuesday, August 12, 2014
SFPUG Video on YouTube
TL;DR: SFPUG video available on YouTube, including streaming video tonight. I need help converting old videos and making new ones.
First, we have a few SFPUG videos available on YouTube:
We will have streaming video for tonight's SFPUG, which will be on Postgres-XL. Join the Google Event to get a link to the streaming video once it starts; expected start time is 7:15PM PST.
This brings me to the third thing, which is that I could use some help with video, in two areas:
First, we have a few SFPUG videos available on YouTube:
- Wanelo.com presented "12-Step Program for Database Application Performance"
- CitusData presented "cstore_FDW for Big Data"
We will have streaming video for tonight's SFPUG, which will be on Postgres-XL. Join the Google Event to get a link to the streaming video once it starts; expected start time is 7:15PM PST.
This brings me to the third thing, which is that I could use some help with video, in two areas:
- I desperately need someone to help take better-quality videos of SFPUG meetings, with better focus and sound. Right now I'm using a Logitech webcam, and it's just not cutting it. Video will need to be compatible with Google Hangouts, unless you have another plan for broadcasting.
- I have 7 or 8 videos of SFPUG meetings from 2013, including Tom Lane explaining the query planner, in FLV format. I need someone to transcode these to a YouTube format and upload them.
Thursday, August 7, 2014
Call for quotes for the 9.4 release announcement
We are looking for endorsement quotes for the 9.4 release.
Specifically, we want quotes from end-users or ISVs* who have not been
featured for a previous release, and enthuse about one of the following
topics:
(* while we love our PostgreSQL consulting and support companies,
endorsements from them aren't news in a PR sense)
Specifically, we want quotes from end-users or ISVs* who have not been
featured for a previous release, and enthuse about one of the following
topics:
- JSONB
- Streaming Changesets/Bi-Directional Replication
- Performance improvements in 9.4
- Something else about Flexibility, Scalability or Performance
(* while we love our PostgreSQL consulting and support companies,
endorsements from them aren't news in a PR sense)