One of PostgreSQL's advantages over many other SQL databases is letting users modify database objects with a minimum of locking and fuss, and do it in a transaction so it can be rolled back. However, a minimum of locking isn't no locking, so one still has to plan for getting a lock to push through the database migration.
For example, say I wanted to add two new columns to a table. I already know that by making the column nullable, and not setting a default, I can add them with a metadata-only change which requires only an update to the system catalogs. However, this table change does require an ACCESS EXCLUSIVE lock for a few milleseconds to go through, which can be tricky. See, an access exclusive lock blocks everything, including reads and autovacuum, which can be a bit messy if your timing is bad. You can end up waiting for that lock behind a long-running query, and then all the other database traffic can pile up behind you.
Now, if we had ALTER TABLE CONCURRENTLY it would take care of this for you. But we don't (yet), so you need to fake it with a DO script. Here's an example DO script which polls for a lock every 2 seconds until it can get one, then pushes through the table change:
DO $f$
DECLARE ntries INT := 10;
sleepytime INT := 2;
BEGIN
FOR get_lock IN 1 .. ntries LOOP
BEGIN
LOCK TABLE mytable
IN ACCESS EXCLUSIVE MODE NOWAIT;
ALTER TABLE mytable ADD COLUMN new_col1 INT,
ADD COLUMN new_col2 VARCHAR;
RAISE INFO 'table updated';
RETURN;
EXCEPTION
WHEN lock_not_available THEN
PERFORM pg_sleep(sleepytime);
END;
END LOOP;
RAISE INFO 'unable to obtain lock after % tries', ntries;
END;$f$;
The idea here is that you keep trying to LOCK ... NOWAIT, which will throw and error if it can't get the lock immediately. Then it sleeps and tries again 2 seconds later. If you're using 9.3 or later, you can take an even better approach, using lock_timeout:
DO $f$
DECLARE ntries INT := 10;
sleepytime INT := 2;
BEGIN
SET lock_timeout = '100ms';
FOR get_lock IN 1 .. ntries LOOP
BEGIN
LOCK TABLE mytable IN ACCESS EXCLUSIVE MODE;
ALTER TABLE mytable ADD COLUMN a int,
ADD COLUMN b INT,
ADD COLUMN c INT;
RAISE INFO 'table updated';
RETURN;
EXCEPTION
WHEN lock_not_available THEN
PERFORM pg_sleep(sleepytime);
END;
END LOOP;
RAISE INFO 'unable to obtain lock after % tries', ntries;
END;$f$;
The advantage of this is that it doesn't have to get the lock immediately; it waits 100ms, then gives up on getting the lock. That helps in situations where the table is never completely free of read queries, without risking a serious pile-up of requests.
You should be able to adapt this approach. to the migrations you actually need to push through.
Monday, August 31, 2015
Friday, August 28, 2015
Stupid Hacks: Dictionary replace function
I write a lot more PL/pgSQL than I'd like to. Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive. One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.
Postgres has two built-in ways to do string substitution: concatenation and format(). Both have drawbacks. Let me give you an example:
EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
'.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;
EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);
You can see the problem here. Both formats are hard to read and hard to maintain. Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values. So I wrote up a quick hack to do this in PL/pgSQL.
Here's the dict-replace function.
Using it, you'd replace the above with:
EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);
Of course, you need to first set up the vardict as a JSON value, in the form:
vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'
Still, much more readable, eh? No refcounting, no repeating variables, no string breaks. I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.
There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values. It's not SQL-injection safe, so don't use it for stuff handling user input. You still have to do your own quoting of strings. And if you have access to PL/Perl or PL/Python you don't need this nonsense.
But for the minimal case, it should help.
Postgres has two built-in ways to do string substitution: concatenation and format(). Both have drawbacks. Let me give you an example:
EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
'.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;
EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);
You can see the problem here. Both formats are hard to read and hard to maintain. Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values. So I wrote up a quick hack to do this in PL/pgSQL.
Here's the dict-replace function.
Using it, you'd replace the above with:
EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);
Of course, you need to first set up the vardict as a JSON value, in the form:
vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'
Still, much more readable, eh? No refcounting, no repeating variables, no string breaks. I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.
There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values. It's not SQL-injection safe, so don't use it for stuff handling user input. You still have to do your own quoting of strings. And if you have access to PL/Perl or PL/Python you don't need this nonsense.
But for the minimal case, it should help.
Friday, August 14, 2015
Most of the talks chosen for pgConfSV
Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV. We still have a few more talks to pick and the tutorials to finalize. Regardless, it's looking like a really exciting lineup!
We'll have folks from Heap, Wanelo, TreasureData, Rackspace, Turnitin.com, EMC, Joyent, Square, and more. The sad part will be missing three talks in each timeslot.
Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.
pgConfSV is November 17th and 18th in South San Francisco. If you're going to AnsibleCon, that's the day after, so you can do both!
We'll have folks from Heap, Wanelo, TreasureData, Rackspace, Turnitin.com, EMC, Joyent, Square, and more. The sad part will be missing three talks in each timeslot.
Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.
pgConfSV is November 17th and 18th in South San Francisco. If you're going to AnsibleCon, that's the day after, so you can do both!
Wednesday, August 12, 2015
Two Great Tastes That Taste Great Together: cstore + Pipeline
cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application. Because it's a column store, though, cstore wants new data added in batches, the bigger the better. This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly.
This is where PipelineDB comes in. Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw. I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.
As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly. First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.
Then I created the stream and continuous view in PipelineDB:
CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );
CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
FROM seenstream
WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );
Next, I created the cstore table:
CREATE EXTENSION cstore_fdw;
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;
CREATE FOREIGN TABLE seen (
user_id int,
page_id int,
ts timestamp tz
)
SERVER cstore_server
OPTIONS (compression 'pglz');
Finally, I added a simple script which ran the following query once per hour:
INSERT INTO seen
SELECT user_id, page_id, ts
FROM seenfeed
WHERE ts >= ( now() - interval '1 hour' )
ORDER BY user_id, page_id, ts;
... and then I started everything in motion.
Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized. In this particular case, I knew that I would be doing more analysis by user. But most users would do ORDER BY ts instead.
After a few hours, I checked back, and now I can run some analytical queries on the cstore table. For example, user activity:
select user_id, count(distinct page_id), max(ts)
from seen group by user_id;
user_id | count | max
--------+-------+-------------------------------
1 | 92 | 2015-08-11 22:59:51.504777+00
2 | 86 | 2015-08-11 22:54:09.77318+00
3 | 89 | 2015-08-11 22:59:14.574697+00
page activity:
select page_id, count(distinct user_id) as duv, max(ts) as last
from seen group by page_id order by duv desc;
page_id | duv | last
--------+-----+-------------------------------
71 | 96 | 2015-08-11 22:59:38.690743+00
99 | 96 | 2015-08-11 22:58:43.004618+00
4 | 96 | 2015-08-11 22:57:45.95007+00
... and more.
Now, in a production environment, you'll want to do more than that. Depending on traffic, you might batch inserts daily instead of hourly. You might want to use several continuous views to store different summaries of the data instead of raw data. But the above should be enough to show you how well these two tools go together.
This is where PipelineDB comes in. Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw. I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.
As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly. First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.
Then I created the stream and continuous view in PipelineDB:
CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );
CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
FROM seenstream
WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );
Next, I created the cstore table:
CREATE EXTENSION cstore_fdw;
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;
CREATE FOREIGN TABLE seen (
user_id int,
page_id int,
ts timestamp tz
)
SERVER cstore_server
OPTIONS (compression 'pglz');
Finally, I added a simple script which ran the following query once per hour:
INSERT INTO seen
SELECT user_id, page_id, ts
FROM seenfeed
WHERE ts >= ( now() - interval '1 hour' )
ORDER BY user_id, page_id, ts;
... and then I started everything in motion.
Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized. In this particular case, I knew that I would be doing more analysis by user. But most users would do ORDER BY ts instead.
After a few hours, I checked back, and now I can run some analytical queries on the cstore table. For example, user activity:
select user_id, count(distinct page_id), max(ts)
from seen group by user_id;
user_id | count | max
--------+-------+-------------------------------
1 | 92 | 2015-08-11 22:59:51.504777+00
2 | 86 | 2015-08-11 22:54:09.77318+00
3 | 89 | 2015-08-11 22:59:14.574697+00
page activity:
select page_id, count(distinct user_id) as duv, max(ts) as last
from seen group by page_id order by duv desc;
page_id | duv | last
--------+-----+-------------------------------
71 | 96 | 2015-08-11 22:59:38.690743+00
99 | 96 | 2015-08-11 22:58:43.004618+00
4 | 96 | 2015-08-11 22:57:45.95007+00
... and more.
Now, in a production environment, you'll want to do more than that. Depending on traffic, you might batch inserts daily instead of hourly. You might want to use several continuous views to store different summaries of the data instead of raw data. But the above should be enough to show you how well these two tools go together.
Tuesday, August 11, 2015
Please, security test our code!
Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:
And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches. More information on our Security Page.
Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.
- Feel free to "reverse engineer" the code. In fact, here it is on github if you want to scrutinize it.
- We generally credit security researchers who find real security holes (with limitations for duplicates, etc.). Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
- If you think you've found a security issue, please report it to security@postgresql.org. If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
- Be prepared to answer questions about your report. We showed you our code, you can show us yours.
And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches. More information on our Security Page.
Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.
Friday, August 7, 2015
Understanding Unintuitive TABLESAMPLE Results
In 9.5 Alpha 2:
create table mil ( id int, val text );
insert into mil select i, i::text || '-val' from
generate_series(1,1000000) as gs(i);
analyze;
postgres=# select * from mil tablesample system ( 0.04 );
id | val
----+-----
(0 rows)
Huh, what?
So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM. Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random. Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages). But that's not what happened. In fact, running the following query I got a variety of counts:
SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;
What's really happening is that pages for SYSTEM are selected a different way. Each page is checked against the probability once. This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.
This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%. BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable. It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.
Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.
So, what if you need TABLESAMPLE to get a very specific number of rows for you? Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method. This can be loaded as the tsm_system_rows extension in 9.5.
Hopefully that helps folks be less confused than me about how TABLESAMPLE works.
create table mil ( id int, val text );
insert into mil select i, i::text || '-val' from
generate_series(1,1000000) as gs(i);
analyze;
postgres=# select * from mil tablesample system ( 0.04 );
id | val
----+-----
(0 rows)
Huh, what?
So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM. Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random. Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages). But that's not what happened. In fact, running the following query I got a variety of counts:
SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;
370
370
370
555
555
185
0
925
925? 0? What the hey?
What's really happening is that pages for SYSTEM are selected a different way. Each page is checked against the probability once. This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.
This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%. BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable. It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.
Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.
So, what if you need TABLESAMPLE to get a very specific number of rows for you? Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method. This can be loaded as the tsm_system_rows extension in 9.5.
Hopefully that helps folks be less confused than me about how TABLESAMPLE works.
Thursday, August 6, 2015
Time to test 9.5 Alpha 2
PostgreSQL 9.5 Alpha 2 is out. We're still calling this one an Alpha instead of a Beta, because major replumbing is still underway. However, most of the issues reported with Alpha 1 have been fixed. A bunch of really useful bug reports came from regular users who tested out the alpha with their own workloads or tests, so please keep up the good work!
And, if you didn't test Alpha 1, well, we still need your help testing Alpha 2. Here's some ideas for what to test.
Again, to make testing easier for folks, I've created a docker image of Alpha 2. If you already run docker, simply run jberkus/postgres95-test and you can test our features without needing to worry about how to install Postgres. Specific notes on this test image are on the wiki.
As part of updating to alpha2, I managed to shrink the image by about 200MB by getting rid of a lot of build code. However, I did not remove 150MB of build and compile tools because I'm expecting testers to want to install additional extensions or tools in the image.
And, if you didn't test Alpha 1, well, we still need your help testing Alpha 2. Here's some ideas for what to test.
Again, to make testing easier for folks, I've created a docker image of Alpha 2. If you already run docker, simply run jberkus/postgres95-test and you can test our features without needing to worry about how to install Postgres. Specific notes on this test image are on the wiki.
As part of updating to alpha2, I managed to shrink the image by about 200MB by getting rid of a lot of build code. However, I did not remove 150MB of build and compile tools because I'm expecting testers to want to install additional extensions or tools in the image.