[HN Gopher] Function pipelines: Building functional programming ...
___________________________________________________________________
Function pipelines: Building functional programming into PostgreSQL
Author : oconnor663
Score : 252 points
Date : 2021-10-19 15:29 UTC (7 hours ago)
(HTM) web link (blog.timescale.com)
(TXT) w3m dump (blog.timescale.com)
| eska wrote:
| Cool stuff!
|
| Out of curiosity: how does the composition of the sort function
| work? Can you avoid knowing all rows at the point it occurs?
| Lockerman wrote:
| Not yet, right now all datapoints must be known at the during
| the sort
| tytrdev wrote:
| _laughs in datomic_
| dkersten wrote:
| Not everyone can use datomic though, for many reasons
| (including cost). I'd love to use datomic, but its just not an
| option for me, while postgres is (and the DigitalOcean hosted
| postgres is quite affordable too and supports TimescaleDB too).
| claytonjy wrote:
| Would that be the community version of Timescale then?
|
| Looks like this is a whole new extension, does DO allow you
| to install your own, or do you think it'd be easy to make a
| case for them adding this one? (I'm not a DO user myself)
| djk447 wrote:
| (NB: Post author here)
|
| This is in the TimescaleDB Toolkit extension [1] which is
| licensed under our community license for now and it's not
| available on DO. It is available on our cloud service fully
| managed. You can also install it and run it for free
| yourself.
|
| [1]: https://github.com/timescale/timescaledb-toolkit
| ithrow wrote:
| Until you use the very slow and extremely buggy black box in
| production.
| darksaints wrote:
| This is really freaking cool. I'm not often impressed by clever
| hacks in SQL, but this solves a really annoying problem and the
| specific way which you solved this problem with just custom
| functions and operators is ingenious and reassuring, as opposed
| to some of the other crazy ideas that I've seen to extend SQL.
|
| I would love to see more about how you implemented the lambdas,
| as that's the one thing that is still non-obvious to me.
| djk447 wrote:
| (NB: Post author here)
|
| We're thinking about doing a post on them at some point in the
| future!
|
| Any specific questions we should answer?
| darksaints wrote:
| I guess mostly how the syntax works. The `$$` thing I mostly
| recognize as being a code block in custom functions, but I'm
| not sure how it works in the context of a lambda. Also, is
| `$value` just a special identifier for the input?
|
| A separate question...is it possible to have tuple-valued
| timevectors?
| djk447 wrote:
| The docs have a bit more on it [1]. The $$ is really just a
| dollar quoted string[2], you can use any sort of string in
| there, but dollar quoting just makes it a lot easier to
| read.
|
| On your other question, not right now, but if it's
| something you want, file a github issue, it's something
| we've thought a bit about and it gets hairy pretty quickly,
| but also increases the power of these things dramatically.
| We'll probably do something with it, but it may be a little
| while and we may let this stuff bake for a bit first...
|
| [1]: https://docs.timescale.com/timescaledb/latest/how-to-
| guides/... [2]:https://www.postgresqltutorial.com/dollar-
| quoted-string-cons...
| ghop02 wrote:
| While I agree the partition syntax may not be super
| straightforward, adding a new operator for another way to sort,
| sum, and abs etc. seems like chaos
| claytonjy wrote:
| I think "may not be super straightforward" undersells the
| learning curve here. I took to SQL pretty naturally but gnarly
| partitioning is still the easiest way for me to get tripped up.
| Having taught a lot of data sciencey folks advanced SQL, even
| simple uses of partitioning tend to be a big barrier compared
| to what comes before.
| djk447 wrote:
| (NB: Post author here)
|
| Totally understand the concern, for now, it's just for a
| particular datatype, all of the other operations are happening
| normally inside your query and you can still use window
| functions...
|
| And btw, I love window functions and will probably be doing a
| post on them, but they can be a little bit weird and have some
| really odd ordering properties with aggregates etc.
|
| The point of this is really much more about providing choices
| for folks. But yeah, if we find that people are just getting
| really confused we could re-consider, that's part of why we
| released it experimentally.
| akulkarni wrote:
| Also, the team picked `->` as the operator, which I thought
| was a very natural and intuitive choice :-)
|
| e.g., SELECT device_id, timevector(ts, val)
| -> sort() -> delta() -> abs() -> sum() as volatility
| OJFord wrote:
| What if those functions (were something that) returned
| json{,b}?
|
| Oh or I see timevector returns a custom datatype, so I
| suppose the answer is it has to be one of a fixed number of
| provided types?
| djk447 wrote:
| (NB: post author here)
|
| We're planning on providing various outputs, json is
| definitely one format we'd want to support, want to file
| a github issue with the layout that you'd want and we can
| start getting some feedback to implement?
| Lockerman wrote:
| > Oh or I see timevector returns a custom datatype, so I
| suppose the answer is it has to be one of a fixed number
| of provided types?
|
| Exactly! These aren't general arithmetic operators,
| they're specifically for mapping arithmetic operations
| over timevectors.
| ecoulthard wrote:
| I'm looking for a more efficient way of detecting errors in
| metrics that store running totals, such as Odometer. I use the
| delta of Odometer to compute Distance. But decreases in Odometer
| causes incorrect Distance calculations. Odometer should never
| decrease. It is an error if Odometer decreases. Would function
| pipelines be able to help me detect decreases in Odometer?
| djk447 wrote:
| (NB: Post author here)
|
| Yes! You can use the `delta()` element followed by a `filter()`
| with a lambda for looking for negative numbers I think!
| jbylund wrote:
| What's the explain plan for a pipeline like this look like?
|
| In my experience one of the biggest things scaring devs away from
| moving more logic over the wire towards the db is lubrication
| between version control and database state. If I end up defining
| 10s or 100s of custom pipeline operators in git how do I keep
| that in sync with the db? and am I responsible for doing all that
| legwork myself?
| 1point618 wrote:
| This is immediately useful. The number of awkwardly nested
| subqueries that I write because I don't have this kind of
| functionality in BigQuery is a huge bummer.
|
| I am curious, how many Haskell programmers do you all have? And
| more seriously, do you have any plans to help provide this
| functionality to non-Postgres SQLs, or at least help those trying
| to take inspiration from it?
| fifilura wrote:
| I wonder if the array functions in the presto engine (used for
| example in Amazon Athena) should be able to handle this too?
|
| https://prestodb.io/docs/0.217/functions/array.html
| fifilura wrote:
| Ok, a 15 minute hack. It may be possible to make it look
| nicer.
|
| And it looked better in TFA, so good job!
| SELECT device_id, reduce(
| zip_with( array_sort(zip(array_agg(ts),
| array_agg(val))),
| slice(array_sort(zip(array_agg(ts), array_agg(val))),2,
| 5000), (a,b) -> abs(a.field1 - b.field1))
| , 0, (s, x) -> s + COALESCE(x,0), s -> s) as res FROM
| ( VALUES (1, 2, 1), (1,
| 3, 0), (1, 9, 3), (1, 3, 4),
| (2, 2, 1), (2, 8, 0), (3, 4, 3)
| ) AS t(device_id, val, ts)
|
| GROUP BY device_id device_id res 1
| 14 3 0 2 6
| jpitz wrote:
| Minor nitpick: Nested subqueries ARE awkward, which is why I
| would express SELECT device id,
| sum(abs_delta) as volatility FROM ( SELECT
| device_id, abs(val - lag(val) OVER (PARTITION BY device_id
| ORDER BY ts)) as abs_delta FROM measurements
| WHERE ts >= now() - '1 day'::interval) calc_delta )
| GROUP BY device_id;
|
| this way: WITH temperature_delta_past_day AS
| ( SELECT device_id, abs(val - lag(val) OVER
| (PARTITION BY device_id ORDER BY ts)) as abs_delta
| FROM measurements WHERE ts >= now() - '1
| day'::interval //edit - the remainder of this line is a typo:
| ) calc_delta ) SELECT device id, sum(abs_delta)
| as volatility FROM temperature_delta_past_day
| GROUP BY device_id;
|
| To me, it is a lot more natural to use SQL's CTE syntax to
| 'predefine' my projections before making use of them, instead
| of trying to define them inline - like the difference between
| when you'd define a lambda directly inline vs defining a
| separate function for it.
|
| I don't know if trying to embed a DataFrame-esque api inside of
| SQL is a thing that would benefit me, but it is an interesting
| idea.
| jpruittsql wrote:
| CTEs are wonderful for readability, but until recent versions
| of PostgreSQL they were always "materialized" which can have
| performance implications vs. subqueries. The NOT MATERIALIZED
| option to CTEs was added in PostgreSQL 12. (Timescale
| Engineer)
| nextos wrote:
| Came here to say the same thing. Some years back, to pass my
| CS databases course, we were required to write really complex
| queries in the computer lab. We had very limited time to do
| so.
|
| Many people failed because their queries became complex
| monoliths, hard to debug or optimize when things went wrong.
|
| That's because they limited themselves to SQL-92. We were
| using Oracle, so there was no reason not to use SQL:1999. I
| made heavy use of WITH, and it was quite effortless.
| pmontra wrote:
| I'm also using CTEs a lot precisely because subqueries are
| hard to reason about, and these pipelines are in turn so much
| easier too.
| fifilura wrote:
| I agree. There is something magic with CTEs, just moving
| things around a little bit makes it so much easier to
| comprehend.
|
| You could even move the val and lag(val) to two different
| columns and do the abs() in the summary.
|
| That way you can query temperature_delta_past_day and see for
| yourself what it does.
| djk447 wrote:
| (NB: Post author here)
|
| Yeah. CTEs definitely make it a bit easier to read, though
| some people get more confused by them, especially because
| they don't exist in all SQL variants.
|
| And totally agree with that last bit! We want to see if it's
| useful for folks, it's released experimentally now and we'll
| see what folks can do with it. One thing that's fun and that
| we may do a post explaining a bit more is that these
| pipelines are actually values as well, so the transforms that
| you run can be stored in a column in the database as well.
|
| And that starts offering some really mind-bending stuff. The
| example I used was building on the one in the post except now
| you have thermocouples with different calibration curves. You
| can actually store a polynomial or other calibration curve in
| a column and apply the correct calibration to each individual
| thermocouple with a JOIN...which is kinda crazy, but pretty
| awesome. So we want to figure out how to use these and what
| people can do with them and see where it takes us.
| jpitz wrote:
| They've been around quite a long time, but yeah it seems
| like only SQL89 gets taught.
|
| Window Functions and CTEs are both major force multipliers
| in the language, so I always encourage folks to go learn
| them.
| EE84M3i wrote:
| Note that your second example has a misplaced pair of parens:
| WITH temperature_delta_past_day AS (
| SELECT device_id, abs(val - lag(val) OVER (PARTITION BY
| device_id ORDER BY ts)) as abs_delta FROM
| measurements WHERE ts >= now() - '1
| day'::interval) calc_delta ) SELECT device
| id, sum(abs_delta) as volatility FROM
| temperature_delta_past_day GROUP BY device_id;
|
| should probably be WITH
| temperature_delta_past_day AS ( SELECT
| device_id, abs(val - lag(val) OVER (PARTITION BY device_id
| ORDER BY ts)) as abs_delta FROM measurements
| WHERE ts >= now() - '1 day'::interval )
| SELECT device id, sum(abs_delta) as volatility FROM
| temperature_delta_past_day GROUP BY device_id;
| jpitz wrote:
| You're correct - I was in a hurry to express the idea, and
| I failed to check the where clause.
| djk447 wrote:
| (NB: Post author here)
|
| We don't have many Haskell programmers, we mostly work in Rust,
| C and Go, but we're always open to new things...
|
| This is pretty Postgres specific. From the beginning Postgres
| has focused on extensibility and allowed this sort of stuff
| with custom functions/operators/types. Many other SQL variants
| don't have that. It's one of the main things that sets Postgres
| apart from other databases, see Stonebreaker's great history of
| this, specifically the stuff on Object Relational databases
| [1].
|
| We're pretty focused on building on top of Postgres because of
| that functionality. We do have some other stuff to make
| Postgres more scalable, and you're welcome to try us out, but
| if there's something specific that BigQuery offers that you
| need feel free to file a github issue around that too. But
| yeah, no plans to do things like this in other databases, they
| just don't have the infrastructure...
|
| [1]:https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pap
| e...
| akulkarni wrote:
| (TimescaleDB co-founder)
|
| Agree - what would you need to see in PostgreSQL /
| TimescaleDB to enable you to switch from Big Query? We are
| all ears :-)
| methyl wrote:
| Not the op, but infinite scalability without any
| infrastructure work required besides loading the data is
| what holds me on BigQuery
| akulkarni wrote:
| Thanks for sharing.
|
| "infinite scalability" is something we are aiming for in
| our new vision for Timescale Cloud:
|
| https://blog.timescale.com/blog/announcing-the-new-
| timescale...
|
| We are not there today, but we are making progress.
|
| Any other feedback - please let us know :-)
| btown wrote:
| To do something similar without custom extensions, it's worth
| knowing about Postgres's lateral joins, which allow you to reuse
| columns from the left side of your join in the right side. It
| basically realizes the "enrich my results" promise that SQL
| should always have had, and makes incredibly flexible analytics
| queries possible without ever needing to indent more than once!
|
| https://heap.io/blog/postgresqls-powerful-new-join-type-late...
|
| https://ddrscott.github.io/blog/2017/what-the-sql-lateral/
| chrisjc wrote:
| So does this functional SQL just get "transpiled" into analytical
| SQL? If so, where does "transpilation" occur, in the client,
| intercepted before hitting the query engine, etc? Or has the
| query engine itself been modified to handle functional SQL?
|
| If the query engine itself has been modified, are there any
| performance gains or losses?
|
| This is a pretty interesting and compelling idea. I do wonder if
| the functional SQL approach might entice developers to write very
| bad/expensive/ineffective queries. Then again, nothing is
| stopping developers from doing the same with analytical SQL.
| ccleve wrote:
| Duplicating my question to Lockerman below. Sorry for the
| double post.
|
| Is the -> operator just syntactic sugar? Is this:
|
| SELECT device_id, timevector(ts, val) -> sort() -> delta() ->
| abs() -> sum() as volatility
|
| the same as this?
|
| SELECT device_id, sum(abs(delta(sort(timevector(ts, val))))) as
| volatility
| djk447 wrote:
| Sorry I missed before was out grabbing lunch, answered above!
| WireBaron wrote:
| This is essentially correct. It's a little more complicated
| inside, and the expansion's really more like:
|
| SELECT device_id, arrow_run_pipeline(timevector(ts, val),
| arrow_add_element(sort(), arrow_add_element(delta(),
| arrow_add_element(abs(), sum())));
|
| The notable difference here is that this presents a lot more
| optimization potential, as the entire pipeline can
| conceivably be applied in one pass through the table.
| djk447 wrote:
| (NB: Post author here)
|
| So this is a great question and we may do a whole post on this
| at some point in the future. There are definitely ways to shoot
| yourself in the foot, but hopefully we can avoid at least some
| of them.
|
| The main way this happens now is that the pipeline elements
| actually return a custom data type that represents the composed
| set of functions, and we can (for at least part of the
| pipeline) handle simplification and combination of that into a
| single set of functions to run over the `timevector`.
|
| With that said, there isn't great support for common
| subexpression elimination in PG right now, we're hoping we can
| solve some of that with both subqueries and maybe some fun
| stuff in the planner/executor, but we're going to try to solve
| that as we think about moving this from experimental into
| stable.
|
| Right now we hardly modify the query engine at all, these are
| just normal functions that are run and they'll probably remain
| that way for a while. The optimization here will happen more
| inside the datastructure or within a row to make sure we're not
| scanning the same `timevector` multiple times ideally. We do
| hope to add in some vectorization and other bits as we keep
| going but not 100% where that's going to go yet. I'm also going
| to ask a colleague to respond and clarify this cause he's
| working more heavily on it and I know some of the bits but only
| some of it...
| chrisjc wrote:
| > return a custom data type that represents the composed set
| of functions, and we can (for at least part of the pipeline)
| handle simplification and combination of that into a single
| set of functions to run over the `timevector`
|
| I'm really having trouble wrapping my head around this idea.
| I guess it speaks volumes about the potential and power of
| PostgreSQL in general.
|
| Has PostgreSQL ever been used like this in the past, or is
| this really thinking outside of the box?
|
| We don't use TimeScale or have any intention of in the
| immediate future (Snowflake provides 99% of what we need
| atm), but I'm always interested and impressed to hear what
| they're doing.
| djk447 wrote:
| (NB: Post author here)
|
| Yeah. It's a bit mind bending I guess. Will try to explain
| but might just need a longer format thing and we will
| probably do a piece on this in the future.
|
| Essentially, each function returns a special
| "PipelineElement" datatype that identifies the function,
| but not what it's acting on. When you string them together,
| it returns a "PipelineElement" that represents the
| composition of all the functions with the correct ordering
| etc. That can then be applied to the incoming `timevector`.
|
| As far as I know, no one's done this before. I don't think
| it's a completely new idea, Stonebreaker thought about how
| you can treat code as data and thought that bringing data
| and code closer together was an important goal of the
| Postgres project, but this specific way of doing it is
| pretty new and out there I think!
|
| I think most people would think that in order to do this
| you'd need to do something like modify the parser or
| something more invasive.
|
| (We do use one planner hook on the function call, the
| expression simplification one that allows us to convert
| something like `timevector(ts, val) -> sort() -> delta() ->
| abs() -> sum()` into `timevector(ts, val) -> (sort() ->
| delta() -> abs() -> sum())` so that all the operations can
| be done in a single pass. But that's pretty minor. If we
| need to figure something out around the common
| subexpression elimination we may need to hook into the
| planner a little more, we'll see).
| [deleted]
| djk447 wrote:
| Oh yeah, and building on this, this is also the source of
| the ability to actually store the pipeline element in the
| database, say in a column and apply it later that I
| mentioned in a comment below. That's one of the cool
| mind-bending things this allows.
|
| The example I gave below was building on the example in
| the post, where we had thermometers. Except now think
| about if you have thermocouples and each thermocouple had
| its own calibration curve, and they were different types
| so they might even have different types of equations
| providing the calibration curve.
|
| You can now store the pipeline that does the correct
| adjustments and calculates the actual temperature
| separately for each `device_id` and JOIN to the
| "calibrations" table to apply the calculation.
|
| It's kinda crazy, but really cool and I'm hoping to do a
| post on that whole thing at some point to encourage
| people to play around with it and see what sorts of
| things it allows...
| derefr wrote:
| For your next magic trick, I'd suggest using these to
| build a Postgres equivalent to Lucene's persisted
| levenstein automata :)
| [deleted]
| [deleted]
| Lockerman wrote:
| (Timescale engineer here)
|
| To summarize a bit on what David said here[1]: there are no
| modifications to the query engine, this is all using Postgres's
| custom operator support.
|
| https://news.ycombinator.com/item?id=28920110
| ccleve wrote:
| Is this just syntactic sugar, then?
|
| Is this:
|
| SELECT device_id, timevector(ts, val) -> sort() -> delta() ->
| abs() -> sum() as volatility
|
| the same as this?
|
| SELECT device_id, sum(abs(delta(sort(timevector(ts, val)))))
| as volatility
| djk447 wrote:
| (NB Post author)
|
| Kinda. It's close to that, except in order to make it do
| that we had to actually make the functions return special
| types so it's more equivalent to something like
|
| SELECT device_id, apply(sum, apply(abs, apply(delta,
| apply(sort, timevector(ts, val))))))
|
| Where each of the items in there is an instance of our
| special "PipelineElement" type, which defines the function
| it's applying.
|
| Does that make any sense at all? Not sure if I'm explaining
| this well...
| ccleve wrote:
| It does make sense. Thanks.
| stingraycharles wrote:
| I can assume that this special syntax allows for better
| optimizations; the SQL variant is much more flexible, which
| means you can make much less assumptions, and thus less room
| for optimizations.
|
| Not sure if they're doing that (especially for the v1 version
| of what looks to be primarily a syntactic selling point, rather
| than a perf selling point). But that's what I would do.
| djk447 wrote:
| (NB: post author here)
|
| Yep! you got it exactly right! We're doing it for syntax
| first and then hopefully go back and start doing interesting
| performance optimizations on top. There are definitely some
| things we're thinking about.
| chrisjc wrote:
| Wow, great explanation! Now I really like this idea!
| ccleve wrote:
| It appears that all these extra functions are written in Rust,
| but the core Timescale product is written in C. Why did you do it
| that way? Would you do it differently if you were starting from
| scratch today?
| djk447 wrote:
| (NB: Post author here!)
|
| This is all true! We have a bit more info on that choice in
| this post [1]. But in general the Toolkit extension [2] is
| meant to have a bit of a lighter touch than the core
| TimescaleDB extension [3], it's meant for moving faster and
| working almost entirely with the custom
| functions/types/operators type approach rather than the core
| extension which is integrated much more deeply into the planner
| & executor and deals with a number of other hooks. It would be
| possible to do that in Rust, but because it's so deeply
| integrated with the main Postgres codebase there would just be
| a lot of translation going on and you'd get a lot less of the
| benefits of working in Rust. So C is more natural for much of
| the core DB extension.
|
| PGX [4] is also a huge help for writing this sort of extension.
| The point of Toolkit is really to move much faster, try lots of
| things, mostly using the more SQL facing API (CREATE FUNCTION,
| CREATE AGGREGATE and the like). And PGX helps manage all of
| that really nicely. The rest of Rust being available is also
| great and we like working in the language, but without PGX it'd
| be a lot harder and we couldn't move nearly as fast.
|
| [1]: https://blog.timescale.com/blog/introducing-
| hyperfunctions-n... [2]:
| https://github.com/timescale/timescaledb-toolkit [3]:
| https://github.com/timescale/timescaledb [4]:
| https://github.com/zombodb/pgx
___________________________________________________________________
(page generated 2021-10-19 23:00 UTC)