I have a data analytics project which produces multiple statistical metrics for a large volume of sensor data. This includes percentiles (like median and 90%) as well as average, min and max. Originally this worked using PL/R, which was pretty good except that some of the R modules were crashy, which was not so great for uptime.
This is why, two years ago, I ripped out all of the PL/R and replaced it with PL/Python and SciPy. I love SciPy because it gives me everything I liked about R, without most of the things I didn't like. But now, I've ripped out the SciPy as well. What am I replacing it with? Well, SQL.
In version 9.4, Andrew Gierth added support for percentiles to PostgreSQL via WITHIN GROUP aggregates. As far as I'm concerned, this is second only to JSONB in reasons to use 9.4.
Now, one of the more complicated uses I make of aggregates is doing "circular" aggregates, that is producing percentiles for a set of circular directions in an effort to determine the most common facings for certain devices. Here's the PL/Python function I wrote for this, which calculates circular aggregates using the "largest gap" method. This algorithm assumes that the heading measurements are essentially unordered, so to find the endpoints of the arc we look for two measurements which are the furthest apart on the circle. This means shifting the measurements to an imaginary coordinate system where the edge of this gap is the low measurement, calculating percentiles, and then shifting it back. Note that this method produces garbage if the device turned around a complete circle during the aggregate period.
Now, that SciPy function was pretty good and we used it for quite a while. But we were unhappy with two things: first, SciPy is rather painful as a dependency because the packaging for it is terrible; second, having PostgreSQL call out to SciPy for each iteration isn't all that efficient.
So, since 9.4 has percentiles now, I started writing a function based the built-in SQL percentiles. Initially I was thinking it would be a PL/pgSQL function, but was pleasantly surprised to find that I could write it entirely as a SQL function! Truly, Postgres's SQL dialect is turing-complete.
So here's the new all-SQL function, with some helper functions.
Then I performance tested it, and was pleasantly surprised again. The SciPy version took 2.6 seconds* to aggregate 100,000 sets of 20 measurements. The new SQL version takes 40 milleseconds, cutting response time by 98%. Wow!
And I've eliminated a hard-to-install dependency. So it's all win. Of course, if anyone has ideas on making it even faster, let me know.
Pushing the limits of SQL to the edge of insanity.
(* note: I expect that most of the extra time for the SciPy version is in calling out to Python through PL/Python, rather than in SciPy itself.)
No comments:
Post a Comment