[HN Gopher] An opinionated map of incremental and streaming systems
___________________________________________________________________
An opinionated map of incremental and streaming systems
Author : mpweiher
Score : 150 points
Date : 2021-05-04 12:45 UTC (10 hours ago)
(HTM) web link (scattered-thoughts.net)
(TXT) w3m dump (scattered-thoughts.net)
| munro wrote:
| No mention of Apache Spark? Thinking about where Spark would fit
| in here, because it has both structured and unstructured APIs--I
| think it would fall under most of these categories depending how
| you use, going back to the comment in the article:
|
| > Most of these systems are equally expressive and can emulate
| each other
|
| But a quick stab at how I think it could fall under this
| taxonomy: [structured] If using
| the DataFrame API [unstructured] If
| using the DStream API --- [low temporal
| locality] If you don't watermark and let the state
| grow--which I don't have much experience with, but I
| think could possibly run into scalability issues, you'd have to
| write some imperative logic to cull state--but it's very
| necessary for low temporal locality. [high temporal
| locality] If using watermarks, it will close the and
| take the output for you --- [internally
| consistent / internally inconsistent] Not really
| sure how Spark falls here, I think if you use the DataFrame API,
| you can have it be consistent, but a lot of operations
| aren't supported, so you may end up having to switch to
| DStreams and write code imperatively. And then
| further, what if one of the streams is processed faster than each
| the other in an aggregation? I'm not sure if there's a
| way to specific business rules around making things
| atomic--but I'm sure you could hack it if you dropped low level
| enough. I think this is touched upon in the other
| article: > When combining multiple streams it's important
| to synchronize them so that the outputs > of each reflect
| the same set of inputs.
|
| If someone has more experience with Spark [Structured] Streaming,
| I would love to hear your thoughts. I stick mostly with Spark's
| batch jobs, playing around with datasets in a REPL/Notebook--
| which it really shines for.
|
| That said, I'm really excited by the future of Spark Structured
| Streaming--I want to write declarative code and say how my data
| gets from A to B--not what needs to happen to get from A to B.
| PaulHoule wrote:
| To me the unstructured area is the future of applications
| development and the structured area is barren and overhyped.
|
| (e.g. 'eventually consistent' analytics is so 2021, because it
| looks like you are doing something quantitative,except it doesn't
| matter if you get the right answer. It will get you claps from a
| certain audience but most people lose patience pretty quick when
| the 'numbers don't add up'.)
| elric wrote:
| IMO the key with eventual consistency is making it known which
| bits are currently known to be consistent, and which are still
| in flux. I realize that this is vague, but how that works
| depends on what it is you're processing and how much time it
| takes for consistency to emerge.
|
| If you're doing any kind of Important Reporting on eventually
| consistent data, you'd better make sure that you either know
| you're only including finalised data, or that there's a big fat
| warning with error bars.
| staticassertion wrote:
| Exactly. People act like EC is impossible to make viable, and
| ignore the fact that transactional logic can impose 10-100x
| performance loss. Putting an EC system in front of a
| transactional system can be a massive performance win.
|
| This isn't quite what the Helios paper talks about, but
| there's lots of things like async indexing in there that are
| kinda similar in nature
| dan-robertson wrote:
| I can't really relate your first paragraph to your second, at
| least by the taxonomy in the article which has some consistent
| structured systems as well as inconsistent systems.
|
| I also disagree that unstructured is the future. I think most
| computations actually _are_ structured. Otherwise sql queries
| wouldn't be so useful. I think a lot of processes basically
| start with a big bag of foos and end up with a big bag of
| roughly corresponding bars, so if one foo changes then there is
| only a localised change to the output. That can be encoded with
| an unstructured system but I find they are better for cases
| where outputs are more scalar and the processes between inputs
| and outputs are myriad. While you can do structured-like things
| in an unstructured system, you lose out on a lot of the
| advantages of that structure.
| PaulHoule wrote:
| What I want is recognition of the structured and the
| unstructured.
|
| For instance a tool like this
|
| https://www.drools.org/
|
| maintains a database of timely information and can use rules
| to match events and could be used to manage the numerous
| problems of event-based architectures.
|
| That particular tool mangles Java source code badly in the
| process of compiling rules so it gives error messages that
| make no sense at all, even if you are looking at the
| compiler's source code in the other window and at the running
| compiler in the debugger.
|
| In the light of the interest in "low code" and the real
| success of "business rules" for domains they apply in I am
| amazed there has been less effort to apply production rules
| to the "update the ui when the database changes".
|
| Even though that can look unstructured, the implementation of
| the rules can be done with the same relational operators that
| all those other methods used. A few year ago you would have
| had to specified the indexes by hands, but the self-
| optimizing database patent from Salesforce is run out now and
| there is no reason the system can't learn the frequent query
| patterns itself.
| babarganesh wrote:
| i found this very interesting.
|
| we have a homegrown push-based streaming library in .net based on
| reactive (it supports indexed joins of "tables") and we're
| looking for something more infrastructural as we move to the
| cloud.
| uyt wrote:
| I feel like this map is missing something to tease apart systems
| that are not interchangeable in practice (UI state management vs
| database stuff).
|
| Maybe persistent vs in-memory? Like whether the data is typically
| completely blown away and recomputed on page reload (e.g., ui
| state, dom trees, scene graphs).
| warpech wrote:
| There are systems which are simultaneously in-memory and
| persistent, e.g. Tarantool, Starcounter
| sam0x17 wrote:
| Side note: one thing I've really noticed lacks formalism in a lot
| of programming languages and environments is easily streaming
| data and performing operations on a data "pipe". I normally
| dislike javascript, but Node.js actually does a pretty cool job
| of it with their streams API. Really wish every language had
| something like that built in, with flow control, etc., so you
| could e.g. pipe data through compression => encryption => measure
| size => upload and those sorts of flows. Even more useful is when
| all of these steps can run somewhat in parallel e.g. while datum
| 1 is being encrypted, datum 2 is being compressed, etc.
| superdimwit wrote:
| Go Channels?
| ramchip wrote:
| Elixir has a few interesting abstractions for that: GenStage,
| Flow, Broadway.
|
| https://github.com/dashbitco/flow
| jandrewrogers wrote:
| Great taxonomy breakdown.
|
| The low temporal locality part of the design space is
| particularly interesting. It is extremely general but for many
| data models "low temporal locality" can mean highly variable
| windows on a _per entity_ basis and not on the entire collection,
| which is a challenging criteria for the way most systems are
| designed. Physical world sensor observation data models are a
| good example of this. Existing solutions for low temporal
| locality data models scale quite poorly in practice, even on
| systems nominally designed for them.
|
| At the limit, low temporal locality platforms converge on the
| same unusual design requirements as symmetric mixed-workload
| database kernels. The latter is something we really don't (know
| how to) build, since it has some interesting computer science
| constraints. It is a missing piece of technology to make this
| stuff really work well at scale.
| thom wrote:
| Would love to hear more options in that bottom left box with
| differential dataflow et al, because for me that's where all the
| really interesting work is happening.
| efferifick wrote:
| Checkout differential datalog. Looks super interesting:
| https://github.com/vmware/differential-datalog
|
| But I've only played with souffle, which I am not sure where it
| lands in this taxonomy. I think it lands on the high temporal
| locality but compared to differential datalog, the biggest
| difference in my opinion is the souffle solves problems as a
| batch (i.e., all the inputs are known at command issue time)
| while differential datalog may receive inputs at runtime.
|
| There's also incA: which I think competes with differential
| datalog. https://github.com/szabta89/IncA
|
| So, how I see Datalog (which is a subset of prolog) falling
| into this taxonomy:
|
| Datalog (batch-processing) would fall under structured/high-
| temporal locality/internally consistent.
|
| Datalog (incremental) would fall on structured/low-temporal
| locality/internally consistent.
|
| Looking a little closer on what is defined as "consistent"
| here, incremental Datalog might not be "consistent" because it
| might return an "incorrect" or "unavailable" computation for a
| past input if you "update" or modify the input. But if one
| restricts the subset of Differential Datalog to follow
| functional semantics, then that would be consistent. Not really
| super knowledgable about this projects beyond playing with them
| and reading some papers on Datalog/Souffle.
| PaulHoule wrote:
| I think differential dataflow might have some application to
| the less structured domain and that makes it interesting.
| losvedir wrote:
| Somewhat off-topic, but I feel like the "patronage economy" is
| really taking off, with Patreon and GitHub sponsors. Content like
| this is really interesting, and I've appreciated this blog's
| various write-ups in the past.
|
| There's another great blog, with writings supported by patrons,
| whose name escapes me (I think the domain ends in something about
| "lime", with the '.me' ccTLD).
|
| Whenever I see those domains on HN, I always click, since I have
| a high expectation the content will be good, given how it was
| created.
| mathgladiator wrote:
| I like the unstructured/structured boundary as it mirrors the "do
| you use a library" (some Rx) or "do you use a client" (like
| Kafka).
|
| I'm writing a programming-language ( http://www.adama-lang.org/ )
| which is reactive, and I'm currently looking at how do I optimize
| (prematurely, of course). I believe a combination of static
| analysis and a good runtime working together are the future (at
| least, for me).
|
| For instance, I currently statically optimize "iterate _table
| where id == x" to "_table.get(x)", and this was a good
| optimization. I use the same analysis to introduce some limited
| indices.
|
| The paradox for my use case (board games) is that going further
| isn't really needed, and the overhead of optimization starts to
| cost more than brute force table scans. However, I'm excited
| about this approach beyond board games.
| aserafini wrote:
| I believe Redis belongs in this map.
| hrdwdmrbl wrote:
| AFAIK it does not perform computation or control the
| computation of other systems.
| anewhnaccount2 wrote:
| "Last updated 2021-04-18" (so not 2018?)
| mpweiher wrote:
| Fixed, thanks!
___________________________________________________________________
(page generated 2021-05-04 23:01 UTC)