[HN Gopher] Why query planning for streaming systems is hard
       ___________________________________________________________________
        
       Why query planning for streaming systems is hard
        
       Author : simonz05
       Score  : 61 points
       Date   : 2021-08-15 18:44 UTC (4 hours ago)
        
 (HTM) web link (scattered-thoughts.net)
 (TXT) w3m dump (scattered-thoughts.net)
        
       | eventreduce wrote:
       | You should check out the event-reduce algorithm[1]. It scales in
       | a different way on how it calculates new results based on old-
       | Results+Event. So it can have some benefits over materialized
       | views depending on the data size and how many clients subscribe
       | to a query.
       | 
       | [1] https://github.com/pubkey/event-reduce
        
       | SpicyLemonZest wrote:
       | Robust operators in particular seem like an important area of
       | research to me. I've never had time to sit down and think about
       | it in detail, but I can't count the number of escalations I've
       | had that would have been instantly solved if only we had a
       | gracefully degrading broadcast join.
        
       | mathgladiator wrote:
       | Underneath this is the hard truth that streaming is hard and
       | costly. I've run into many people with optimistic dreams around
       | how to take an SQL or other query languages and then make it
       | update based on data changes rather than periodic polling.
       | 
       | These systems are always expensive and yield marginal
       | improvements when compared to a dumb publish subscribe system.
       | I've got notes on publish subscribe systems, and I am publishing
       | a paper next month about the end result system that I built over
       | five years. For a clue about the midpoint, my SRECON talk
       | addresses this:
       | https://www.usenix.org/conference/srecon17americas/program/p...
       | 
       | A key note is that publish/subscribe over-commits the problem
       | space (i.e. the problem space can potentially be quadratic
       | depending on use), but if you give up on durability then your
       | costs can go way-way-way down.
       | 
       | The lack of durability is a problem, BUT it can be leveraged to
       | discover non-marginal use-cases such that further investments
       | make sense. Unfortunately, once you invest in the imperfect
       | system such that it operates at five+ nines during normal
       | operation, then very few query based systems are competitive.
        
       | gopalv wrote:
       | > If we build plans from operators that work well across large
       | ranges of input distributions and that don't have catastrophic
       | failure modes, then we won't be as reliant on accurate
       | statistics.
       | 
       | (TL;DR - there's never enough statistics, except what you collect
       | when running that query)
       | 
       | The problem with a lot of these approaches is that they suck at
       | the small scale - the sorted merge-join is very safe across
       | different data-sizes, but a hash-join will go through it so fast,
       | it wouldn't make a lot of sense to always opt for it.
       | 
       | When I started working on Hive, it was entirely full of "query
       | won't fail, so what if it is slow" joins (or conditional tasks,
       | which are "try to build a hash, if it fails, fail over to a sort-
       | merge" encoded into the plan - that's great when it rarely fails
       | - the config option still bears that scar in
       | hive.auto.convert.join.noconditionaltask=true). There was another
       | pass at this with the Grace hash-join[1] (+ a bloom filter while
       | spilling) to make the hash join safe against overfilling.
       | 
       | The problem was that most situations demanded things the other
       | way around, where the choice of tech depends on a smaller
       | prototype and how fast it goes, with the assumption that larger
       | scales would simply be throwing more hardware at it. So people
       | who tried the product thought it was slow and would scale upwards
       | linearly with scale (not that it would be the same latency from
       | 1Gb all the way to 100), so it wouldn't "sell" in the market.
       | 
       | Eventually, we do the least optimal way of doing this & just
       | record statistics on a query operators through runtime counters +
       | if it fails, just reuse the row statistics collected at runtime
       | back into the plan to re-execute the query with a plan which
       | includes the updated statistics & take another go at it.
       | 
       | This worked better for filter selectivity, particularly where we
       | got NOT IN ('FAILED', 'ABANDONDED') sort of estimates very wrong
       | (a histogram would help, but an HLL+ is useless).
       | 
       | Even though this puts a huge penalty on a bad plan, the bad plans
       | are not common enough for the always-robust sort of mechanism to
       | be needed - so by sacrificing the P99 at scale, you get your P95
       | latencies way down.
       | 
       | And overall, being optimistic has worked out better for the whole
       | cluster utilization, because there's a small fraction of giant
       | queries & everyone else with Tableau is happy that everything
       | returns fast.
       | 
       | [1] - https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join
        
       ___________________________________________________________________
       (page generated 2021-08-15 23:00 UTC)