[HN Gopher] Query Engines: Push vs. Pull
___________________________________________________________________
Query Engines: Push vs. Pull
Author : jsnell
Score : 102 points
Date : 2021-05-01 14:12 UTC (8 hours ago)
(HTM) web link (justinjaffray.com)
(TXT) w3m dump (justinjaffray.com)
| Rusky wrote:
| I like the use of generator functions for the pull-based example-
| the syntax closely mirrors the push-based example, which drives
| home the point that the difference between push and pull lies
| entirely in whether you compose operators via an iterator API or
| a callback API. You can also see this in how the two usage
| examples are inside-out versions of each other.
|
| When you look at it from the perspective of a language designer
| or compiler, things start to look even more similar:
|
| * Values that live across iterations are stored in a closure
| (push) or generator object (pull)
|
| * In the first half of an iteration, before producing a result,
| operators dispatch to their producers via a return address on the
| call stack (push) or callbacks (pull)
|
| * In the second half of an iteration, while building up a result,
| operators dispatch to their consumers via callbacks (push) or a
| return address on the call stack (pull)
|
| The difference the article mentions between "unrolling" the two
| approaches comes from the level of abstraction where inlining is
| performed. The push model produces natural-looking output when
| inlining the closures. The less-natural result of inlining the
| pull model's `next` functions is where the control flow in loops
| shows up. (See for example the performance benefits of adding a
| push-model complement to Rust's usual pull-model Iterator trait:
| https://github.com/rust-lang/rust/pull/42782#issuecomment-30...)
|
| But, you can get the same natural-looking output from the pull
| model if you perform inlining at the level of generators! This is
| a less familiar transformation than function inlining, but it's
| still fundamentally the same idea, of erasing API boundaries and
| ABI overhead- which in this case is a more straightforward
| formulation of the various loop optimizations that can sometimes
| get rid of the control-flow-in-loops produced by inlining `next`
| functions.
|
| The difference around DAGs is similar. In the push model, a DAG
| just means a producer calling into more than one consumer
| callback. In the pull model, this is tricky because you can't
| just yield (i.e. return from `next`) to more than one place-
| return addresses form a stack! (And thus the dual problem for the
| push model shows up for operations with multiple inputs, like the
| merge join mentioned in the article.)
|
| Overall I think a more flexible approach might be to generalize
| from `yield` and iterator APIs, to algebraic effects (or, lower
| level, delimited continuations)- these let you build both push
| and pull-like APIs for a single generator-like function, as well
| as hybrid approaches that combine aspects of both, by removing
| the requirement for a single privileged call stack.
| rzzzt wrote:
| LINQ also seems to have a push-based implementation, given its
| FROM-WHERE-SELECT syntax: https://docs.microsoft.com/en-
| us/dotnet/csharp/programming-g...
| nathanaldensr wrote:
| I don't think that is correct. Underlying much of LINQ is
| _IEnumerable <T>_, which implements the classic _Current_ and
| _MoveNext()_ iterator pattern. The _foreach_ keyword uses
| _IEnumerable <T>_ to "pull" elements one at a time.
|
| LINQ's syntax is just syntactic sugar on top of chained method
| calls to various interface extension methods.
| rzzzt wrote:
| Ahh, you are right. My eyes somehow drifted away from the
| _pull_ example with yield above the "basic push-based query
| engine" header, that is the one which I've found familiar.
| mamcx wrote:
| What caught my attention is the claim of push based queries to be
| easier to generate imperative code, where I could learn about how
| do that?
| slver wrote:
| I'm new to this terminology, but it looks a bit like lazily
| computing a data stream vs. eagerly computing it, doesn't it? Or
| am I wrong.
| Ericson2314 wrote:
| It is a little a bit like that. But the regular formulations of
| "eager" and "lazy" assume expressions that are evaluated once
| (or at at least if evaluated the again will yield the same
| result).
|
| It's better to think of this stuff with constantly changing
| underlying data, which adds a lot of "color" to the domain.
| toddh wrote:
| Nicely explained. I like the idle operators/row dichotomy. At a
| system level push based systems suffer from the same weakness as
| event driven systems: incompleteness. You can never be sure you
| haven't missed an event because it was dropped during system
| startup, network partition, failover, queue full, buffer
| overflow, packet drop, etc. This doesn't always matter, but with
| a pull based system you know the result is based on all the data,
| but as you point out, that has a cost too.
| cube2222 wrote:
| Could you please explain at more length how push/pull plays a
| role here?
|
| With push-based you'd still usually use lossless transports,
| which could also have a "closing packet" which tells the
| consumer that "this stream is done", so I don't see how pull-
| based is better here.
|
| Re queue full: If I understand you correctly, queues are only a
| problem when you start having multiple parallel stages one
| after the other, with queues in-between. Then you have queueing
| both with push-based and pull-based architectures.
| cube2222 wrote:
| I'm doing a rewrite[0] of OctoSQL[1], partly to make it push-
| based instead of pull-based, and for now it's been much simpler
| to work with, won't get into the detailed reasoning in this
| comment.
|
| However, if somebody is interested, this is a great paper on how
| to architect a query compiler:
| https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...
|
| I found it very enlightening and simple to implement. The paper
| also delves into details why the push-based architecture has more
| room for optimizations.
|
| [0]:https://github.com/cube2222/octosql/tree/redesign
|
| [1]:https://github.com/cube2222/octosql/
| Ericson2314 wrote:
| The truth is you need almost everything to do push and pull. Just
| as pull alone does "forks" inefficiently, push alone does "joins"
| inefficiently.
|
| People shy away from this conclusion because doing everything
| both ways is quite hard and requires fancier abstractions to make
| nice than people are familiar with.
| Quekid5 wrote:
| People have been vacillating on this forever and the truth
| is... (drum roll) It depends.
| scott0129 wrote:
| > The truth is you need almost everything to do push and pull.
| Just as pull alone does "forks" inefficiently, push alone does
| "joins" inefficiently.
|
| That's really is a simple and intuitive summary. Thank you for
| that insight!
| legg0myegg0 wrote:
| The DuckDB folks are migrating from pull to push and put together
| this interesting documentation of their reasoning! They use a
| vectorized model instead of a compiled one, so it's another
| interesting comparison point. It seems like push will simplify
| how they handle parallelism.
|
| https://github.com/duckdb/duckdb/issues/1583
| catblast01 wrote:
| > This is the oldest and most well-known query execution model,
| and was first proposed in 1994.
|
| Is the author younger than 25. Not picturing a world of databases
| before 1994? Did the world even exist?
|
| I kid but regardless of when a paper describing some process was
| published the "oldest" query execution model is decidedly a lot
| older than 26 years.
___________________________________________________________________
(page generated 2021-05-01 23:00 UTC)