Thursday, July 30, 2015
Texas Trip: DjangoCon and Postgres Open
This September, I will be having a great Texas adventure, and you're invited along.
September 8-10, Austin: DjangoCon.US. I will be presenting "PostgreSQL Performance in 15 Minutes" on the 9th.
September 9th or 10th, Austin: I will speak at AustinPUG. Date, location and exact topic still TBA. Follow the AustinPUG Meetup, or check back here in a week for an update.
September 16-18, Dallas: Postgres Open: I will be doing my Replication 101 tutorial, and then Explaining Explain.
September is a great time to be in Texas: temperatures are reasonable, Texans are friendly, and there's lots of great Mexican food and barbecue. So, register for DjangoCon and/or Postgres Open today and join me!
Oh, and the photo? That's one of the hand-thrown DjangoCon pint cups I'm making as speaker gifts for DjangoCon. Don't you wish you'd submitted a talk, now?
Tuesday, July 28, 2015
PipelineDB: streaming Postgres
If you've been following the tech news, you might have noticed that we have a new open source PostgreSQL fork called "PipelineDB". Since I've joined the advisory board of PipelineDB, I thought I'd go over what it is, what it does, and what you'd use it for. If you're not interested in Pipeline, or Postgres forks, you can stop reading now.
PipelineDB is a streaming database version of PostgreSQL. The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support. If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data". This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.
On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.
PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream. The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.
As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter. Twitter has a nice streaming API, which gives us some interesting stream data to work with. First I spun up a PipelineDB Docker container. Connecting to it, I created the "twitter" database and a static stream called "tweets":
Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one. As a career DBA, though, implied object names give me the heebie-jeebies. Also, in some future release of PipelineDB, static streams will have performance optimizations, so it's a good idea to get used to creating them now.
docker run pipelinedb/pipelinedb
josh@Radegast:~$ psql -h 172.17.0.88 -p 6543 -U pipeline
Password for user pipeline:
psql (9.4.1, server 9.4.4)
Type "help" for help.
pipeline=# create database twitter;
CREATE DATABASE
pipeline=# \c twitter
twitter=# create stream tweets ( content json );
CREATE STREAM
Then I created a Continous View which pulls out all identified hashtags from each tweet. To do this, I have to reach deep inside the JSON of the tweet structure and use json_array_elements to expand that into a column. Continuous Views also automatically add a timestamp column called "arrival_timestamp" which is the server timestamp when that particular streaming row showed up. We can use this to create a 1-hour sliding window over the stream, by comparing it to clock_timestamp(). Unlike regular views, volatile expressions are allowed in Continuous Views.
CREATE CONTINUOUS VIEW tagstream as
SELECT json_array_elements(content #>
ARRAY['entities','hashtags']) ->> 'text' AS tag
FROM tweets
WHERE arrival_timestamp >
( clock_timestamp() - interval '1 hour' );
This pulls a continous column of tags which appear in the San Francisco stream.
Then I created a linked Docker container with all of the Python tools I need to use TwitterAPI, and then wrote this little script based on a TwitterAPI example. This pulls a stream of tweets with geo turned on and identified as being in the area of San Francisco. Yes, INSERTing into the stream is all that's required for a client to deliver stream data. If you have a high volume of data, it's better to use the COPY interface if your language supports it.
Then I started it up, and it started pushing tweets into my stream in PipelineDB. After that, I waited an hour for the stream to populate with an hour's worth of data.
Now, let's do some querying:
twitter=# select * from tagstream limit 5;
tag
------------
Glendale
Healthcare
Job
Jobs
Hiring
How about the 10 most popular tags in the last hour?
twitter=# select tag, count(*) as tag_count from tagstream group
by tag order by tag_count desc limit 10;
tag | tag_count
---------------+-----------
Hiring | 211
Job | 165
CareerArc | 163
Jobs | 154
job | 138
SanFrancisco | 69
hiring | 60
FaceTimeMeNash | 42
CiscoRocks | 35
IT | 35
I detect a theme here. Namely, one of sponsored tweets by recruiters.
Now, obviously you could do this by storing all of the tweets in an unlogged table, then summarizing them into another table, etc. However, using continuous views avoids a bunch of disk-and-time-wasting store, transform, store again if you're uninterested in the bulk of the data (and tweet metadata is bulky indeed). Further, I can create more continuous views based on the same stream, and pull different summary information on it in parallel.
So, there you have it: PipelineDB, the latest addition to the PostgreSQL family. Download it or install it using the Docker container I built for it.
PipelineDB is a streaming database version of PostgreSQL. The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support. If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data". This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.
On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.
PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream. The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.
As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter. Twitter has a nice streaming API, which gives us some interesting stream data to work with. First I spun up a PipelineDB Docker container. Connecting to it, I created the "twitter" database and a static stream called "tweets":
Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one. As a career DBA, though, implied object names give me the heebie-jeebies. Also, in some future release of PipelineDB, static streams will have performance optimizations, so it's a good idea to get used to creating them now.
docker run pipelinedb/pipelinedb
josh@Radegast:~$ psql -h 172.17.0.88 -p 6543 -U pipeline
Password for user pipeline:
psql (9.4.1, server 9.4.4)
Type "help" for help.
pipeline=# create database twitter;
CREATE DATABASE
pipeline=# \c twitter
twitter=# create stream tweets ( content json );
CREATE STREAM
Then I created a Continous View which pulls out all identified hashtags from each tweet. To do this, I have to reach deep inside the JSON of the tweet structure and use json_array_elements to expand that into a column. Continuous Views also automatically add a timestamp column called "arrival_timestamp" which is the server timestamp when that particular streaming row showed up. We can use this to create a 1-hour sliding window over the stream, by comparing it to clock_timestamp(). Unlike regular views, volatile expressions are allowed in Continuous Views.
CREATE CONTINUOUS VIEW tagstream as
SELECT json_array_elements(content #>
ARRAY['entities','hashtags']) ->> 'text' AS tag
FROM tweets
WHERE arrival_timestamp >
( clock_timestamp() - interval '1 hour' );
This pulls a continous column of tags which appear in the San Francisco stream.
Then I created a linked Docker container with all of the Python tools I need to use TwitterAPI, and then wrote this little script based on a TwitterAPI example. This pulls a stream of tweets with geo turned on and identified as being in the area of San Francisco. Yes, INSERTing into the stream is all that's required for a client to deliver stream data. If you have a high volume of data, it's better to use the COPY interface if your language supports it.
Then I started it up, and it started pushing tweets into my stream in PipelineDB. After that, I waited an hour for the stream to populate with an hour's worth of data.
Now, let's do some querying:
twitter=# select * from tagstream limit 5;
tag
------------
Glendale
Healthcare
Job
Jobs
Hiring
How about the 10 most popular tags in the last hour?
twitter=# select tag, count(*) as tag_count from tagstream group
by tag order by tag_count desc limit 10;
tag | tag_count
---------------+-----------
Hiring | 211
Job | 165
CareerArc | 163
Jobs | 154
job | 138
SanFrancisco | 69
hiring | 60
FaceTimeMeNash | 42
CiscoRocks | 35
IT | 35
I detect a theme here. Namely, one of sponsored tweets by recruiters.
Now, obviously you could do this by storing all of the tweets in an unlogged table, then summarizing them into another table, etc. However, using continuous views avoids a bunch of disk-and-time-wasting store, transform, store again if you're uninterested in the bulk of the data (and tweet metadata is bulky indeed). Further, I can create more continuous views based on the same stream, and pull different summary information on it in parallel.
So, there you have it: PipelineDB, the latest addition to the PostgreSQL family. Download it or install it using the Docker container I built for it.
Friday, July 24, 2015
Using Docker to isolate VPN clients
As a consultant, I need to use VPNs to log into many of our client environments. As one of our data security policies, our staff are not allowed to connect to multiple customer VPNs from the same machine at the same time; plus many VPNs do not permit this. However, this causes some work efficiency issues whenever we need to run a job for a customer which involves a lot of wait time, such as building an index.
I've used VirtualBox VMs as isolation environments, but VBox consumes a lot of memory and system resources. So I started thinking: could I use Docker containers to supply some network/file isolation while connected to multiple VPNs? The answer is yes.
Let's take the difficult case first: a Cisco IPSec VPN. First, I exported the VPN configuration from Gnome's network manager, which creates a PCF file. Next, I start Docker container based on the official Ubuntu Trusty images, giving it the net_admin Linux cap to allow the VPN to work. I also need to mount the new PCF file into a directory in the container so I can access it.
docker run -it --cap-add net_admin \
--volume /home/josh/vpns:/mnt/vpn ubuntu:trusty
After this I need to install the connection software:
apt-get update
apt-get install vpnc ssh
Once vpnc is installed, I need to convert that PCF file to a vpnc file and install it:
cd /mnt/vpn
pcf2vpnc vpn.pcf /etc/vpnc/vpnc.conf
chmod 600 /etc/vpnc/vpnc.conf
Once that's all set up, I can start the VPN:
root@c5b7802ca120:/# vpnc vpnc
Enter password for jberkus@10.1.1.15:
VPNC started in background (pid: 78)...
Success! Sometimes, though, you'll see this error message:
Cannot open "/proc/sys/net/ipv4/route/flush"
I haven't pinned down what causes that yet, but it seems to have no effect on actual ability to use the vpnc VPN. Probably I need one more --cap-add, but I don't yet know what it is.
Anyway, once you have that running, docker commit the image and you have a saved VPN image for that particular VPN (just remember not to upload it to Docker Hub). Then you can run the image whenever you need that VPN.
OpenVPN is actually more complicated, because device access as well as net access is required. On the up side, you only need the config file to actually connect. See this project on how to set up your container, and then run:
openvpn --config /mnt/vpn/config.ovpn
Now, the caveat: Docker containers do not supply nearly the levels of isolation which a full VM would, especially since we have to grant the "net_admin" cap to get the VPN to work. While containers prevent accidental network cross-connection, they would not block targeted malware which attacks the network stack. In our case, that means I can use this technique for our regular clients, but not for PCI-compliant customers or other clients with strong regulated data isolation requirements.
But for the others, I can now easily run 2 VPNs at a time, while protecting them from cross-connection, and me from accidentally targeting the wrong customer's server.
Thanks to Rikaard Hosein, programmerq, and other helpful folks on IRC channel #docker for advice on this.
I've used VirtualBox VMs as isolation environments, but VBox consumes a lot of memory and system resources. So I started thinking: could I use Docker containers to supply some network/file isolation while connected to multiple VPNs? The answer is yes.
Let's take the difficult case first: a Cisco IPSec VPN. First, I exported the VPN configuration from Gnome's network manager, which creates a PCF file. Next, I start Docker container based on the official Ubuntu Trusty images, giving it the net_admin Linux cap to allow the VPN to work. I also need to mount the new PCF file into a directory in the container so I can access it.
docker run -it --cap-add net_admin \
--volume /home/josh/vpns:/mnt/vpn ubuntu:trusty
After this I need to install the connection software:
apt-get update
apt-get install vpnc ssh
Once vpnc is installed, I need to convert that PCF file to a vpnc file and install it:
cd /mnt/vpn
pcf2vpnc vpn.pcf /etc/vpnc/vpnc.conf
chmod 600 /etc/vpnc/vpnc.conf
Once that's all set up, I can start the VPN:
root@c5b7802ca120:/# vpnc vpnc
Enter password for jberkus@10.1.1.15:
VPNC started in background (pid: 78)...
Success! Sometimes, though, you'll see this error message:
Cannot open "/proc/sys/net/ipv4/route/flush"
I haven't pinned down what causes that yet, but it seems to have no effect on actual ability to use the vpnc VPN. Probably I need one more --cap-add, but I don't yet know what it is.
Anyway, once you have that running, docker commit the image and you have a saved VPN image for that particular VPN (just remember not to upload it to Docker Hub). Then you can run the image whenever you need that VPN.
OpenVPN is actually more complicated, because device access as well as net access is required. On the up side, you only need the config file to actually connect. See this project on how to set up your container, and then run:
openvpn --config /mnt/vpn/config.ovpn
Now, the caveat: Docker containers do not supply nearly the levels of isolation which a full VM would, especially since we have to grant the "net_admin" cap to get the VPN to work. While containers prevent accidental network cross-connection, they would not block targeted malware which attacks the network stack. In our case, that means I can use this technique for our regular clients, but not for PCI-compliant customers or other clients with strong regulated data isolation requirements.
But for the others, I can now easily run 2 VPNs at a time, while protecting them from cross-connection, and me from accidentally targeting the wrong customer's server.
Thanks to Rikaard Hosein, programmerq, and other helpful folks on IRC channel #docker for advice on this.
Thursday, July 23, 2015
unsafe UPSERT using WITH clauses
By now you've read about PostgreSQL 9.5 and our shiny new UPSERT feature. And you're thinking, "hey, 9.5 is still in alpha, how do I get me some UPSERT action now?" While there are some suggestions for workarounds on the PostgreSQL wiki, I'm going to show you one more method for approximating UPSERT while you help us test 9.5 so that it will come out sooner.
The general suggested way for handling UPSERT situations is one of the following:
What's the method? Using writeable WITH clauses. This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query. For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:
create table test_upsert ( id int not null primary key,
val text );
insert into test_upsert select i, 'aaa'
from generate_series (1, 100) as gs(i);
Now, let's say we wanted to update ID 50, or insert it if it doesn't exist. We can do that like so:
WITH
newrow ( id, val ) as (
VALUES ( 50::INT, 'bbb'::TEXT ) ),
tryupdate as (
UPDATE test_upsert SET val = newrow.val
FROM newrow
WHERE test_upsert.id = newrow.id
RETURNING test_upsert.id
)
INSERT INTO test_upsert
SELECT id, val
FROM newrow
WHERE id NOT IN ( SELECT id FROM tryupdate );
The above tries to update ID=50. If no rows are updated, it inserts them. This also works for multiple rows:
WITH
newrow ( id, val ) as (
VALUES ( 75::INT, 'ccc'::TEXT ),
( 103::INT, 'ccc'::TEXT )
),
tryupdate as (
UPDATE test_upsert SET val = newrow.val
FROM newrow
WHERE test_upsert.id = newrow.id
RETURNING test_upsert.id
)
INSERT INTO test_upsert
SELECT id, val
FROM newrow
WHERE id NOT IN ( SELECT id FROM tryupdate );
... and will update or insert each row as called for.
Given that we can do the above, why do we need real UPSERT? Well, there's some problems with this approximate method:
The general suggested way for handling UPSERT situations is one of the following:
- Try to do an INSERT. On error, have the application retry with an UPDATE.
- Write a PL/pgSQL procedure which does insert-then-update or update-then-insert.
What's the method? Using writeable WITH clauses. This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query. For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:
create table test_upsert ( id int not null primary key,
val text );
insert into test_upsert select i, 'aaa'
from generate_series (1, 100) as gs(i);
Now, let's say we wanted to update ID 50, or insert it if it doesn't exist. We can do that like so:
WITH
newrow ( id, val ) as (
VALUES ( 50::INT, 'bbb'::TEXT ) ),
tryupdate as (
UPDATE test_upsert SET val = newrow.val
FROM newrow
WHERE test_upsert.id = newrow.id
RETURNING test_upsert.id
)
INSERT INTO test_upsert
SELECT id, val
FROM newrow
WHERE id NOT IN ( SELECT id FROM tryupdate );
The above tries to update ID=50. If no rows are updated, it inserts them. This also works for multiple rows:
WITH
newrow ( id, val ) as (
VALUES ( 75::INT, 'ccc'::TEXT ),
( 103::INT, 'ccc'::TEXT )
),
tryupdate as (
UPDATE test_upsert SET val = newrow.val
FROM newrow
WHERE test_upsert.id = newrow.id
RETURNING test_upsert.id
)
INSERT INTO test_upsert
SELECT id, val
FROM newrow
WHERE id NOT IN ( SELECT id FROM tryupdate );
... and will update or insert each row as called for.
Given that we can do the above, why do we need real UPSERT? Well, there's some problems with this approximate method:
- It's not concurrency-safe, and can produce unexpected results given really bad timing of multiple connections wanting to update the same rows.
- It will still produce key violation errors given bad concurrency timing, just fewer of them than method 1 above.
- It's still higher overhead than 9.5's UPSERT feature, which is optimized.
- It will return INSERT 0 0 from calls that do only updates, possibly making the app think the upsert failed.
- It's not safe to use with INSERT/UPDATE triggers on the same table.
Thursday, July 2, 2015
Test PostgreSQL 9.5 Alpha in a Docker container
Yay, the first alpha of PostgreSQL 9.5 is out! This has lots of cool stuff in it, which you can read about elsewhere. What I'm posting here is a new experiment: offering up Docker images for testing the alphas and betas.
TL:DR = Image here, information and instructions on the wiki.
This occurred to me last week at DockerCon (naturally enough); one of the reasons more people don't test the PostgreSQL beta releases is that it can be a pain to install them, especially with all of the stuff you want to test. So I've just put a couple days' work into making it easier for you: all you gotta do is install the image and you can test away.
And, since you're testing in a container, you don't have to worry about messing up the PostgreSQL installation on the desktop/laptop/server you normally use. If you feel like doing your testing in a public cloud, many of them have ways to run containers as cloud servers, so that's easy to do.
I created this image crammed full of stuff: all of contrib, jsquery, pgxn, PL/python and PL/perl, a sample database, etc., so that you'd have something to test. I wasn't able to include PL/Lua due to build issues, nor PostGIS because I'm building from source, and building PostGIS from source is kinda scary. If you want to help me build a better test image, though, the Dockerfile project for it is here.
In the meantime, I've removed another excuse for not testing PostgreSQL 9.5. So what are you waiting for? Test already!
TL:DR = Image here, information and instructions on the wiki.
This occurred to me last week at DockerCon (naturally enough); one of the reasons more people don't test the PostgreSQL beta releases is that it can be a pain to install them, especially with all of the stuff you want to test. So I've just put a couple days' work into making it easier for you: all you gotta do is install the image and you can test away.
And, since you're testing in a container, you don't have to worry about messing up the PostgreSQL installation on the desktop/laptop/server you normally use. If you feel like doing your testing in a public cloud, many of them have ways to run containers as cloud servers, so that's easy to do.
I created this image crammed full of stuff: all of contrib, jsquery, pgxn, PL/python and PL/perl, a sample database, etc., so that you'd have something to test. I wasn't able to include PL/Lua due to build issues, nor PostGIS because I'm building from source, and building PostGIS from source is kinda scary. If you want to help me build a better test image, though, the Dockerfile project for it is here.
In the meantime, I've removed another excuse for not testing PostgreSQL 9.5. So what are you waiting for? Test already!