hngopher.com

       [HN Gopher] Show HN: Denormalized - Embeddable Stream Processing...
       ___________________________________________________________________
        
       Show HN: Denormalized - Embeddable Stream Processing in Rust and
       DataFusion
        
       tl;dr we built an embeddable stream processing engine in Rust using
       apache DataFusion, check us out at https://github.com/probably-
       nothing-labs/denormalized  Hey HN,  We'd like to showcase a very
       early version of our embeddable stream processing engine called
       Denormalized. The rise of DuckDB has abundantly made it clear that
       even for many workloads of Terabyte scale, a single node system
       outshines the distributed query engines of previous generation such
       as Spark, Snowflake etc in terms of both performance and cost.  Now
       a lot of workloads DuckDB is used for were normally considered to
       be "big data" in the previous generation, but no more. In the
       context of streaming especially, this problem is more acute. A
       streaming system is designed to incrementally process large amounts
       of data over a period of time. Even on the upper end of scale,
       productionized use-cases of stream processing are rarely performing
       compute on more than tens of gigabytes of data at a given time.
       Even so, the standard stream processing solutions such as Flink
       involve spinning up a distributed JVM cluster to even compute
       against the simplest of event streams. To that end, we're building
       Denormalized designed to be embeddable in your applications and
       scale up to hundreds of thousands of events per second with a
       Flink-like dataflow API. While we currently only support Rust, we
       have plans for Python and Typescript bindings soon.  We're built
       atop DataFusion and the Arrow ecosystems and currently support
       streaming joins as well as windowed aggregations on Kafka topics.
       Please check out out repo at: https://github.com/probably-nothing-
       labs/denormalized  We'd love to hear your feedback.
        
       Author : ambrood
       Score  : 84 points
       Date   : 2024-08-15 17:16 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ethegwo wrote:
       | Neat, founder of https://tonbo.io/ here, I am excited to see
       | someone bring stream processing to datafusion, we are working on
       | a arrow-native embedded db and plan to support datafusion in the
       | next release, we're interested in building the streaming feature
       | on denormalized.
        
         | ambrood wrote:
         | thanks for the encouraging words @ethegwo. Tonbo looks very
         | cool and potentially something we could use for our state
         | backend (currently using RocksDB which we aren't that happy
         | about). Would love to chat about how we can work together. Feel
         | free to reach out to me - amey@denormalized.io
        
       | franciscojarceo wrote:
       | Can't wait for the Python SDK!
        
         | emgeee wrote:
         | it'll be coming soon!
        
       | dman wrote:
       | This looks super interesting. I built
       | https://github.com/finos/perspective in a past life but have been
       | out of the streaming analytics game for some time. Nice to see
       | single machine efficiency be a focus, will give this a try and
       | post feedback on github.
        
         | ambrood wrote:
         | this looks so clutch! curious if this was purpose built for the
         | finance industry?
        
           | dman wrote:
           | Yes it was. People wanted a realtime version of pandas for
           | booking up their ticking charts and grids.
        
       | ztratar wrote:
       | Will be excited to see the typescript bindings once out. We may
       | be able to use this to handle some of our workloads at Embra.
       | 
       | Will reach out! Congrats on the ship.
        
         | ambrood wrote:
         | thanks @ztratar. would love to hear about your workloads at
         | embra would be very helpful vis-a-vis the direction of our
         | typescript experience. feel free to drop us an email:
         | hello@denormalized.io
        
       | emgeee wrote:
       | Other founder here -- we've been working on this now for several
       | months and have had a lot of fun building on top of arrow and
       | datafusion
        
       | drawnwren wrote:
       | What differentiates you from i.e. Arroyo and Fluvio?
        
         | ambrood wrote:
         | while haven't checked out Fluvio yet, we are fans of Arroyo.
         | regarding latter my understanding is that the team is going for
         | a SQL first complete replacement for Flink. Denormalized is
         | meant to be an embeddable engine you can import within your
         | project. Our plan is to focus on the developer experience for
         | users building with Python and Typescript in particular.
        
         | necubi wrote:
         | I'm the creator of Arroyo (and have talked a lot with the
         | Denormalized folks) so maybe can answer from my perspective
         | (and Matt and Amey please correct me on any inaccuracies.)
         | 
         | First the similarities: both Arroyo and Denormalized use
         | DataFusion and Arrow and are focused on high-scale, low-latency
         | stateful stream processing.
         | 
         | Arroyo has been around a lot longer and is overall more mature.
         | It's distributed (I believe Denormalized at this point is a
         | single-node engine), supports consistent snapshotting of its
         | state, event time and watermarks, and has a wide range of
         | supported connectors (https://doc.arroyo.dev/connectors). It
         | ships with a control plane, distributed schedulers, and web ui.
         | 
         | But the use cases we're targeting are different. Arroyo
         | programmed via SQL, and is used primarily for real-time data
         | pipelines; we aim to replace Flink SQL and kSQL.
         | 
         | Denormalized (as I understand it) is focused more on data
         | science use cases where it makes sense to have an embedded
         | engine, rather than a distributed one. It's programmed with a
         | Rust dataframe API (and soon Python).
        
       | ashekhawat wrote:
       | Nice! How feature complete is this with current industry
       | standards like Flink?
        
       | shrisukhani wrote:
       | Interesting. What use cases are you guys targeting with this?
        
       | lhnz wrote:
       | Do you have plans to make the data sources pluggable instead of
       | being Kafka specific?
        
         | ambrood wrote:
         | we absolutely do, the library itself is designed to be
         | extensible. we are currently working on adding webhooks as one
         | of our sources. are there are any specific connectors/sources
         | you'd be interested in?
        
       ___________________________________________________________________
       (page generated 2024-08-15 23:00 UTC)