Version 9.4 includes a number of new aggregate, array and set operations which make it vastly easier and faster to do statistical summaries and analysis. Most of these were contributed by Andrew Gierth, including the two I'm going to use below, FILTER and WITHIN GROUP. I'm also going to use MATERIALIZED VIEWS, developed by Kevin Grittner. First, though, I need to import the data. So I downloaded the survey results as a CSV, and created a table for them in PostgreSQL and loaded it up:
CREATE TABLE raw_survey (
ts timestamptz,
prf integer,
moreways integer,
devops integer,
moresql integer,
yesql integer,
);
\copy raw_survey from 'slogans.csv' with csv header
Now, Google's column-per-question format isn't very friendly to analysis and comparison; I want a more vertical orientation. So I create one as a MatView. This means that if I reimport the data in the future, or weed out obvious ballot-box stuffing, I just need to refresh it:
CREATE MATERIALIZED VIEW slogans AS
SELECT 'Performance, Reliability, Flexibility' as slogan, prf as vote
FROM raw_survey
UNION ALL
SELECT 'More Ways to Database', moreways
FROM raw_survey
UNION ALL
SELECT 'A Database for Dev and Ops', devops
FROM raw_survey
UNION ALL
SELECT 'More Than SQL', moresql
FROM raw_survey
UNION ALL
SELECT 'NoSQL + YeSQL = PostgreSQL', yesql
FROM raw_survey;
Now, for some statistics. A total or average is easy, but it's not statistically sound. A median is a much better statistic. I also want to know the balance of people who hated a slogan (1) vs. loved it and put it first (5). So, some of the new aggregates.
In the past, I've retrieved medians by either using SciPy inside PL/Python, or by doing some elaborate calculations on windowing rank. No more. Now I can do a simple one-line median using WITHIN GROUP. WITHIN GROUP is a lot like a windowing aggregate, except that it returns a single summary aggregate. Shipping with version 9.4 are several such aggregates, including percentile_cont() which is one of three functions which allow you to get the value at the stated percent of a sorted group: in this case, 0.5 to get a median. Like so:
SELECT slogan,
percentile_cont(0.5) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;
slogan | percentile_cont |
---|---|
A Database for Dev and Ops | 3 |
More Than SQL | 3 |
More Ways to Database | 3 |
NoSQL + YeSQL = PostgreSQL | 3 |
Performance, Reliability, Flexibility | 4 |
"Performance, Reliability, Flexibility" is taking a clear lead here. Incidentally, percentile_cont() can take an array of values in order to give you a full box (remember, every time you say "big data" without drawing a box plot, God kills a kitten):
SELECT slogan,
percentile_cont(ARRAY[0.1,0.25,0.5,0.75,0.9]) WITHIN GROUP (ORDER BY vote)
FROM slogans
GROUP BY slogan;
slogan | percentile_cont |
---|---|
A Database for Dev and Ops | {1,2,3,3,4} |
More Than SQL | {1.4,2,3,4,5} |
More Ways to Database | {1,2,3,4,5} |
NoSQL + YeSQL = PostgreSQL | {1,1,3,4,5} |
Performance, Reliability, Flexibility | {2,3,4,5,5} |
SELECT slogan,
sum(CASE WHEN vote = 1 THEN 1 ELSE 0 END) as hates,
sum(CASE WHEN vote = 5 THEN 1 ELSE 0 END) as loves
FROM slogans
GROUP BY slogan;
Awkward, neh? Well, no more, thanks to the FILTER clause:
SELECT slogan,
count(*) FILTER ( WHERE vote = 1 ) as hates,
count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;
Isn't that way more intuitive and readable? I think it is, anyway. So, let's put it all together:
SELECT slogan,
percentile_cont(0.5) WITHIN GROUP (ORDER BY vote) as median,
count(*) FILTER ( WHERE vote = 1 ) as hates,
count(*) FILTER ( WHERE vote = 5 ) as loves
FROM slogans
GROUP BY slogan;
And the results:
slogan | median | hates | loves |
---|---|---|---|
A Database for Dev and Ops | 3 | 47 | 21 |
More Than SQL | 3 | 32 | 58 |
More Ways to Database | 3 | 39 | 55 |
NoSQL + YeSQL = PostgreSQL | 3 | 81 | 58 |
Performance, Reliability, Flexibility | 4 | 11 | 118 |
And there we have it: "Performance, Reliability, Flexibility" is the winning theme idea for 9.4. It wins on median, and on hates vs. loves counts.
Congratulations Korry Douglas; I'll contact you about shipping your Chelnik. Note that the theme will be workshopped a little bit to fit in the structure of the final 9.4 release announcement (i.e. we may change it slightly to match the sections of the actual press release), but we're going with that general idea now.
No comments:
Post a Comment