This upcoming Tuesday, the 27th, SFPUG will have live streaming video of Chris Bohn from Etsy talking about how he uses PostgreSQL and Vertica together to do data analysis of Etsy's hundreds of gigabytes of customer traffic. barring technical difficulties with the video or internet, of course.
The video will be on the usual SFPUG Video Channel. It is likely to start around 7:15PM PDT. Questions from the internet will be taken on the attached chat channel.
For those in San Francisco, this event will be held at Etsy's new downtown SF offices, and Etsy is sponsoring a Tacolicious taco bar. Of course, the event is already full up, but you can always join the waiting list.
In other, related events, sfPython will be talking about PostgreSQL performance, and DjangoSF will be talking about multicolumn joins, both on Wednesday the 28th. I'll be at DjangoSF, doing my "5 ways to Crash Postgres" lightning talk.
Friday, August 23, 2013
Thursday, August 8, 2013
Another reason to set pg_hba.conf restrictively
pg_hba.conf can be a pain in the neck. You're expecting some burst traffic, you deploy 8 new virtual appservers, fire them up, and ... they can't connect to Postgres. Gah!
That kind of experience makes it tempting to set everything to:
host all all 0.0.0.0/0 md5
But I just encountered another case this month why you shouldn't. While everyone thinks about malefactors as the main reason to have tight restrictions on which hosts can connect via pg_hba.conf, one should remember that accidents are far more common than malice. As was the case this time.
I set up a benchmark test for a client which involved replaying 8 hours of activity logs against a test copy of their database. What they didn't tell me (because they didn't know due to poor internal documentation) was that their database used DBlink to write data to another database several times a minute.
Thank goodness they had exercised good pg_hba.conf practices! All I got was a bunch of DBlink errors, instead of scribbling all over production data. Whew!
That kind of experience makes it tempting to set everything to:
host all all 0.0.0.0/0 md5
But I just encountered another case this month why you shouldn't. While everyone thinks about malefactors as the main reason to have tight restrictions on which hosts can connect via pg_hba.conf, one should remember that accidents are far more common than malice. As was the case this time.
I set up a benchmark test for a client which involved replaying 8 hours of activity logs against a test copy of their database. What they didn't tell me (because they didn't know due to poor internal documentation) was that their database used DBlink to write data to another database several times a minute.
Thank goodness they had exercised good pg_hba.conf practices! All I got was a bunch of DBlink errors, instead of scribbling all over production data. Whew!
Tuesday, August 6, 2013
Speaker Boot Camp at LinuxCon
Are you technically expert but locutionarily inept? Are you giving your first or second talk this year? Do you present frequently, but don't feel like you're "connecting" with your audience? Are you going to LinuxCon in New Orleans?
Then you should attend the Give a Great Tech Talk workshop the evening before LinuxCon. It's a sort of "speaker boot camp"; if you are speaking at LinuxCon, or better are attending LinuxCon and planning to speak at other conferences in 2013-2014, you should consider attending.
So that speakers can have coaching before the beginning of the conference, the session will be from 5pm to 8:30pm on Sunday the 15th (concurrent with the Meet and Greet), before the start of LinuxCon, at the Hyatt. The Linux Foundation will provide food. Join us and learn to deliver a better talk!
Then you should attend the Give a Great Tech Talk workshop the evening before LinuxCon. It's a sort of "speaker boot camp"; if you are speaking at LinuxCon, or better are attending LinuxCon and planning to speak at other conferences in 2013-2014, you should consider attending.
So that speakers can have coaching before the beginning of the conference, the session will be from 5pm to 8:30pm on Sunday the 15th (concurrent with the Meet and Greet), before the start of LinuxCon, at the Hyatt. The Linux Foundation will provide food. Join us and learn to deliver a better talk!
Monday, August 5, 2013
Fancy SQL Monday: generate_series() and cross joins
So, here's the situation: you want to create a report which shows totals of registrations by day and by category, and shows zeros for each day and category you don't have any registrations for. Your first attempt looks like this:
SELECT date_trunc('day',registered_on) as registration_date,
category,
count(*) as no_registrations,
sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;
However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results. You want it to display a zero. So, how do you do that?
Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented. The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN. Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server. But this time, you want one.
To do a cross join, though, you need the set of all days. This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue! generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval. So, like so:
generate_series('2013-07-01','2013-08-01',interval '1 day')
Putting that together with the categories in a cross join, we get this:
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
That give us our grid. Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations. The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.
WITH cat_days AS (
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
)
SELECT cat_days.reg_day,
cat_days.category,
count(registrations.registration_id) as registrations,
coalesce(sum(fee),0.0) as total_earned
FROM cat_days
LEFT JOIN registrations ON
cat_days.category = registrations.category
AND cat_days.reg_day <= registrations.registered_on
AND ( cat_days.reg_day + interval '1 day') >
registrations.registered_on
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;
So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful. Enjoy!
SELECT date_trunc('day',registered_on) as registration_date,
category,
count(*) as no_registrations,
sum(fee) as total_earned
FROM registrations
WHERE registered_on BETWEEN '2013-07-01' and '2013-08-01'
GROUP BY date_trunc('day', registered_on), category
ORDER BY date_trunc('day', registered_on), category;
However, the results of this query have a problem: if there were no registrations in a particular category on a particular day, it doesn't display any results. You want it to display a zero. So, how do you do that?
Well, what you want is to multiply the set of days in the month with the set of categories, forming a matrix where every category/day is represented. The SQL syntax to say "give me all of X multiplied by all of Y" is a CROSS JOIN. Normally, you don't want to do a cross join because it could result in millions or billions of rows, which would eat all the memory on your server. But this time, you want one.
To do a cross join, though, you need the set of all days. This is easy if you have a calendar table, but if you don't ... generate_series() to the rescue! generate_series() is a terrific utility which creates one row for each iteration of a set expressed by start, end, interval. So, like so:
generate_series('2013-07-01','2013-08-01',interval '1 day')
Putting that together with the categories in a cross join, we get this:
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
That give us our grid. Now we're going to use a LEFT JOIN to attach it to the set of registrations, so that we get zeros for days without registrations. The easiest way to do this is to wrap the above query in a WITH clause, otherwise known as a CTE.
WITH cat_days AS (
SELECT category, reg_day
FROM categories
CROSS JOIN generate_series('2013-07-01','2013-08-01',interval '1 day')
AS gs(reg_day)
)
SELECT cat_days.reg_day,
cat_days.category,
count(registrations.registration_id) as registrations,
coalesce(sum(fee),0.0) as total_earned
FROM cat_days
LEFT JOIN registrations ON
cat_days.category = registrations.category
AND cat_days.reg_day <= registrations.registered_on
AND ( cat_days.reg_day + interval '1 day') >
registrations.registered_on
GROUP BY reg_day, cat_days.category
ORDER BY reg_day, cat_days.category;
So: CROSS JOINs, WITH clauses, and LEFT JOINs can actually be useful. Enjoy!
Thursday, August 1, 2013
9.4 Commitfest 1 Wrap-up
The first CommitFest of the 9.4 development cycle involved an unprecedented number of patch submissions. As of today, 9.4CF1 is still open because of one patch which apparently only Tom Lane can commit. However, everything else out of a brutal commitfest has been done, and it's 16 days after the CF was supposed to finish, so I think I can write the wrap-up now.
First, let's talk numbers: this CF started with 106 patches, peaked at 108 patches, and finished with 102 patches (6 were moved to September). This is over 50% more patches than we've had for the first CommitFest of the annual development cycle and more patches than we've had in any single CF before. Last year, CF1 only involved 59 patches. So if you are a committer and you feel exhausted, that's why. Some graphs:
As you can see, this year is a jump forward in the number of patches being submitted to the first commitfest, and if it's representative of the rest of the year, we're all going to spend a lot more time reviewing patches.
Why so many patches? Well, I can think of a few reasons:
So what did we get out of all this patch activity? Some highlight features:
So, what are we going to do to handle more patches? I have some thoughts on that in my next blog post ...
First, let's talk numbers: this CF started with 106 patches, peaked at 108 patches, and finished with 102 patches (6 were moved to September). This is over 50% more patches than we've had for the first CommitFest of the annual development cycle and more patches than we've had in any single CF before. Last year, CF1 only involved 59 patches. So if you are a committer and you feel exhausted, that's why. Some graphs:
Why so many patches? Well, I can think of a few reasons:
- contributions from the staffs of SalesForce and Huawei
- beginning of 2ndQuadrant's work on streaming logical replication
- a bunch of regression test patches
- some new major contributors, including one from Google Summer of Code
- more people submitting WIP patches to get spec feedback
So what did we get out of all this patch activity? Some highlight features:
- new pgbench options
- ALTER TABLE ... ALTER CONSTRAINT
- improvements in compression of BLOBs and large rows
- a bunch more regression tests
- REFRESH MATERIALIZED VIEW CONCURRENTLY
- create background workers at runtime
- FILTER option for aggregates
- Part 1 of streaming logical replication
- WITH ORDINALITY to supply "row numbers" for set-returning functions
So, what are we going to do to handle more patches? I have some thoughts on that in my next blog post ...