[HN Gopher] An opinionated map of incremental and streaming systems
       ___________________________________________________________________
        
       An opinionated map of incremental and streaming systems
        
       Author : mpweiher
       Score  : 150 points
       Date   : 2021-05-04 12:45 UTC (10 hours ago)
        
 (HTM) web link (scattered-thoughts.net)
 (TXT) w3m dump (scattered-thoughts.net)
        
       | munro wrote:
       | No mention of Apache Spark? Thinking about where Spark would fit
       | in here, because it has both structured and unstructured APIs--I
       | think it would fall under most of these categories depending how
       | you use, going back to the comment in the article:
       | 
       | > Most of these systems are equally expressive and can emulate
       | each other
       | 
       | But a quick stab at how I think it could fall under this
       | taxonomy:                   [structured]              If using
       | the DataFrame API              [unstructured]              If
       | using the DStream API              ---              [low temporal
       | locality]              If you don't watermark and let the state
       | grow--which I don't have much experience with,         but I
       | think could possibly run into scalability issues, you'd have to
       | write some         imperative logic to cull state--but it's very
       | necessary for low temporal locality.              [high temporal
       | locality]              If using watermarks, it will close the and
       | take the output for you              ---              [internally
       | consistent / internally inconsistent]              Not really
       | sure how Spark falls here, I think if you use the DataFrame API,
       | you can have         it be consistent, but a lot of operations
       | aren't supported, so you may end up having to         switch to
       | DStreams and write code imperatively.              And then
       | further, what if one of the streams is processed faster than each
       | the other in         an aggregation?  I'm not sure if there's a
       | way to specific business rules around making         things
       | atomic--but I'm sure you could hack it if you dropped low level
       | enough.              I think this is touched upon in the other
       | article:         > When combining multiple streams it's important
       | to synchronize them so that the outputs         > of each reflect
       | the same set of inputs.
       | 
       | If someone has more experience with Spark [Structured] Streaming,
       | I would love to hear your thoughts. I stick mostly with Spark's
       | batch jobs, playing around with datasets in a REPL/Notebook--
       | which it really shines for.
       | 
       | That said, I'm really excited by the future of Spark Structured
       | Streaming--I want to write declarative code and say how my data
       | gets from A to B--not what needs to happen to get from A to B.
        
       | PaulHoule wrote:
       | To me the unstructured area is the future of applications
       | development and the structured area is barren and overhyped.
       | 
       | (e.g. 'eventually consistent' analytics is so 2021, because it
       | looks like you are doing something quantitative,except it doesn't
       | matter if you get the right answer. It will get you claps from a
       | certain audience but most people lose patience pretty quick when
       | the 'numbers don't add up'.)
        
         | elric wrote:
         | IMO the key with eventual consistency is making it known which
         | bits are currently known to be consistent, and which are still
         | in flux. I realize that this is vague, but how that works
         | depends on what it is you're processing and how much time it
         | takes for consistency to emerge.
         | 
         | If you're doing any kind of Important Reporting on eventually
         | consistent data, you'd better make sure that you either know
         | you're only including finalised data, or that there's a big fat
         | warning with error bars.
        
           | staticassertion wrote:
           | Exactly. People act like EC is impossible to make viable, and
           | ignore the fact that transactional logic can impose 10-100x
           | performance loss. Putting an EC system in front of a
           | transactional system can be a massive performance win.
           | 
           | This isn't quite what the Helios paper talks about, but
           | there's lots of things like async indexing in there that are
           | kinda similar in nature
        
         | dan-robertson wrote:
         | I can't really relate your first paragraph to your second, at
         | least by the taxonomy in the article which has some consistent
         | structured systems as well as inconsistent systems.
         | 
         | I also disagree that unstructured is the future. I think most
         | computations actually _are_ structured. Otherwise sql queries
         | wouldn't be so useful. I think a lot of processes basically
         | start with a big bag of foos and end up with a big bag of
         | roughly corresponding bars, so if one foo changes then there is
         | only a localised change to the output. That can be encoded with
         | an unstructured system but I find they are better for cases
         | where outputs are more scalar and the processes between inputs
         | and outputs are myriad. While you can do structured-like things
         | in an unstructured system, you lose out on a lot of the
         | advantages of that structure.
        
           | PaulHoule wrote:
           | What I want is recognition of the structured and the
           | unstructured.
           | 
           | For instance a tool like this
           | 
           | https://www.drools.org/
           | 
           | maintains a database of timely information and can use rules
           | to match events and could be used to manage the numerous
           | problems of event-based architectures.
           | 
           | That particular tool mangles Java source code badly in the
           | process of compiling rules so it gives error messages that
           | make no sense at all, even if you are looking at the
           | compiler's source code in the other window and at the running
           | compiler in the debugger.
           | 
           | In the light of the interest in "low code" and the real
           | success of "business rules" for domains they apply in I am
           | amazed there has been less effort to apply production rules
           | to the "update the ui when the database changes".
           | 
           | Even though that can look unstructured, the implementation of
           | the rules can be done with the same relational operators that
           | all those other methods used. A few year ago you would have
           | had to specified the indexes by hands, but the self-
           | optimizing database patent from Salesforce is run out now and
           | there is no reason the system can't learn the frequent query
           | patterns itself.
        
       | babarganesh wrote:
       | i found this very interesting.
       | 
       | we have a homegrown push-based streaming library in .net based on
       | reactive (it supports indexed joins of "tables") and we're
       | looking for something more infrastructural as we move to the
       | cloud.
        
       | uyt wrote:
       | I feel like this map is missing something to tease apart systems
       | that are not interchangeable in practice (UI state management vs
       | database stuff).
       | 
       | Maybe persistent vs in-memory? Like whether the data is typically
       | completely blown away and recomputed on page reload (e.g., ui
       | state, dom trees, scene graphs).
        
         | warpech wrote:
         | There are systems which are simultaneously in-memory and
         | persistent, e.g. Tarantool, Starcounter
        
       | sam0x17 wrote:
       | Side note: one thing I've really noticed lacks formalism in a lot
       | of programming languages and environments is easily streaming
       | data and performing operations on a data "pipe". I normally
       | dislike javascript, but Node.js actually does a pretty cool job
       | of it with their streams API. Really wish every language had
       | something like that built in, with flow control, etc., so you
       | could e.g. pipe data through compression => encryption => measure
       | size => upload and those sorts of flows. Even more useful is when
       | all of these steps can run somewhat in parallel e.g. while datum
       | 1 is being encrypted, datum 2 is being compressed, etc.
        
         | superdimwit wrote:
         | Go Channels?
        
         | ramchip wrote:
         | Elixir has a few interesting abstractions for that: GenStage,
         | Flow, Broadway.
         | 
         | https://github.com/dashbitco/flow
        
       | jandrewrogers wrote:
       | Great taxonomy breakdown.
       | 
       | The low temporal locality part of the design space is
       | particularly interesting. It is extremely general but for many
       | data models "low temporal locality" can mean highly variable
       | windows on a _per entity_ basis and not on the entire collection,
       | which is a challenging criteria for the way most systems are
       | designed. Physical world sensor observation data models are a
       | good example of this. Existing solutions for low temporal
       | locality data models scale quite poorly in practice, even on
       | systems nominally designed for them.
       | 
       | At the limit, low temporal locality platforms converge on the
       | same unusual design requirements as symmetric mixed-workload
       | database kernels. The latter is something we really don't (know
       | how to) build, since it has some interesting computer science
       | constraints. It is a missing piece of technology to make this
       | stuff really work well at scale.
        
       | thom wrote:
       | Would love to hear more options in that bottom left box with
       | differential dataflow et al, because for me that's where all the
       | really interesting work is happening.
        
         | efferifick wrote:
         | Checkout differential datalog. Looks super interesting:
         | https://github.com/vmware/differential-datalog
         | 
         | But I've only played with souffle, which I am not sure where it
         | lands in this taxonomy. I think it lands on the high temporal
         | locality but compared to differential datalog, the biggest
         | difference in my opinion is the souffle solves problems as a
         | batch (i.e., all the inputs are known at command issue time)
         | while differential datalog may receive inputs at runtime.
         | 
         | There's also incA: which I think competes with differential
         | datalog. https://github.com/szabta89/IncA
         | 
         | So, how I see Datalog (which is a subset of prolog) falling
         | into this taxonomy:
         | 
         | Datalog (batch-processing) would fall under structured/high-
         | temporal locality/internally consistent.
         | 
         | Datalog (incremental) would fall on structured/low-temporal
         | locality/internally consistent.
         | 
         | Looking a little closer on what is defined as "consistent"
         | here, incremental Datalog might not be "consistent" because it
         | might return an "incorrect" or "unavailable" computation for a
         | past input if you "update" or modify the input. But if one
         | restricts the subset of Differential Datalog to follow
         | functional semantics, then that would be consistent. Not really
         | super knowledgable about this projects beyond playing with them
         | and reading some papers on Datalog/Souffle.
        
         | PaulHoule wrote:
         | I think differential dataflow might have some application to
         | the less structured domain and that makes it interesting.
        
       | losvedir wrote:
       | Somewhat off-topic, but I feel like the "patronage economy" is
       | really taking off, with Patreon and GitHub sponsors. Content like
       | this is really interesting, and I've appreciated this blog's
       | various write-ups in the past.
       | 
       | There's another great blog, with writings supported by patrons,
       | whose name escapes me (I think the domain ends in something about
       | "lime", with the '.me' ccTLD).
       | 
       | Whenever I see those domains on HN, I always click, since I have
       | a high expectation the content will be good, given how it was
       | created.
        
       | mathgladiator wrote:
       | I like the unstructured/structured boundary as it mirrors the "do
       | you use a library" (some Rx) or "do you use a client" (like
       | Kafka).
       | 
       | I'm writing a programming-language ( http://www.adama-lang.org/ )
       | which is reactive, and I'm currently looking at how do I optimize
       | (prematurely, of course). I believe a combination of static
       | analysis and a good runtime working together are the future (at
       | least, for me).
       | 
       | For instance, I currently statically optimize "iterate _table
       | where id == x" to "_table.get(x)", and this was a good
       | optimization. I use the same analysis to introduce some limited
       | indices.
       | 
       | The paradox for my use case (board games) is that going further
       | isn't really needed, and the overhead of optimization starts to
       | cost more than brute force table scans. However, I'm excited
       | about this approach beyond board games.
        
       | aserafini wrote:
       | I believe Redis belongs in this map.
        
         | hrdwdmrbl wrote:
         | AFAIK it does not perform computation or control the
         | computation of other systems.
        
       | anewhnaccount2 wrote:
       | "Last updated 2021-04-18" (so not 2018?)
        
         | mpweiher wrote:
         | Fixed, thanks!
        
       ___________________________________________________________________
       (page generated 2021-05-04 23:01 UTC)