[HN Gopher] ArkFlow: High-performance Rust stream processing engine
       ___________________________________________________________________
        
       ArkFlow: High-performance Rust stream processing engine
        
       Author : klaussilveira
       Score  : 125 points
       Date   : 2025-04-29 14:38 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | habobobo wrote:
       | Looks interesting, how does this compare to arroyo and
       | vector.dev?
        
         | tormeh wrote:
         | Also curious about any comparison to Fluvio.
        
         | necubi wrote:
         | (I'm the creator of Arroyo)
         | 
         | I haven't dug deep into this project, so take this with a grain
         | of salt.
         | 
         | ArkFlow is a "stateless" stream processor, like vector or
         | benthos (now Redpanda Connect). These are great for routing
         | data around your infrastructure while doing simple, stateless
         | transformations on them. They tend to be easy to run and scale,
         | and are programmed by manually constructing the graph of
         | operations.
         | 
         | Arroyo (like Flink or Rising Wave) is a "stateful" stream
         | processor, which means it supports operations like windowed
         | aggregations, joins, and incremental SQL view maintenance.
         | Arroyo is programmed declaratively via SQL, which is
         | automatically planned into a dataflow (graph) representation.
         | The tradeoff is that state is hard to manage, and these systems
         | are much harder to operate and scale (although we've done a lot
         | of work with Arroyo to mitigate this!).
         | 
         | I wrote about the difference at length here:
         | https://www.arroyo.dev/blog/stateful-stream-processing
        
       | fer wrote:
       | Previous discussion (46 days ago):
       | https://news.ycombinator.com/item?id=43358682
        
       | shawabawa3 wrote:
       | seems like a simplified equivalent of https://vector.dev/
       | 
       | a major difference seems to be converting things to arrow and
       | using SQL instead of using a DSL (vrl)
        
         | sofixa wrote:
         | > seems like a simplified equivalent of https://vector.dev/
         | 
         | No? Vector is for observability, to get your metrics/logs,
         | transform them if needed, and put them in the necessary
         | backends. Transformation is optional, and for cases like
         | downsampling or converting formats or adding metadata.
         | 
         | ArkFlow gets data from stuff like databes and message
         | queues/brokers, transforms it, and puts it back in databases
         | and message queues/brokers. Transformation looks like a pretty
         | central use case.
         | 
         | Very different scenarios. It's like saying that a Renault
         | Kangoo is a simplified equivalent of a BTR-80 because both have
         | wheels, engine and space for stuff.
        
           | rockwotj wrote:
           | Its a rust port of Redpanda Connect (benthos), but with less
           | connectors
           | 
           | https://github.com/redpanda-data/connect
        
           | necubi wrote:
           | Vector is often used for observability data (in part because
           | it's now owned by Datadog) but it's not limited to that. It's
           | a general purpose stateless stream processing engine, and can
           | be used for any kind of events.
        
             | sofixa wrote:
             | Vector started for observability data only, and that's why
             | they got bought by Datadog.
        
         | hoherd wrote:
         | Incidentally arkflow implements VRL https://github.com/arkflow-
         | rs/arkflow/pull/273
        
       | muffa wrote:
       | Looks very similar to redpanda-connect/benthos
        
       | coreyoconnor wrote:
       | How do you educate people on stream processing? For pipeline like
       | systems stream processing is essential IMO - backpressure/circuit
       | breakers/etc are critical for resilient systems. Yet I have a
       | hard time building an engineering team that can utilize stream
       | processing; Instead of just falling back on synchronous
       | procedures that are easier to understand (But nearly always
       | slower and more error prone)
        
         | serial_dev wrote:
         | It's important to consider whether it's worth it, even?
         | 
         | I worked on stream processing, it was fun, but I also believe
         | it was over-engineered and brittle. The customers also didn't
         | want real-time data, they looked at the calculated values once
         | a week, then made decisions based on that.
         | 
         | Then, I joined another company that somehow had money to pay
         | 50-100 people, and they were using CSV, sh scripts, batch
         | processing, and all that. It solved the clients' needs, and
         | they didn't need to maintain a complicated architecture and the
         | code that could have been difficult to reason about otherwise.
         | 
         | The first company with the stream processing after I left, was
         | bought by a competitor at fire sale price, some of the tech
         | were relevant for them, but the stream processing stuff was
         | immediately shut down. The acquiring company had just simple
         | batch processing and they were printing money in comparison.
         | 
         | If you think it's still worth going with stream processing,
         | give your reasoning to the team, and most reasonable developers
         | would learn it if they really believe it's a significantly
         | better solution for the given problem.
         | 
         | Not to over-simplify, but if you can't convince 5 out of 10
         | people to learn to make their job better, it's either that the
         | people are not up to the task, or you are wrong that stream
         | processing would make a difference.
        
           | senderista wrote:
           | Yeah that reminds me of a startup I worked at that did real-
           | time analytics for digital marketing campaigns. We went to
           | all kinds of trouble to update dashboards with 5-minute
           | latency, and real-time updates made for impressive sales
           | demos, but I don't think we had a single customer that
           | actually needed to make business decisions within 24 hours of
           | looking at the data.
        
             | serial_dev wrote:
             | We were doing TV ads analytics by detecting ads on TV
             | channels and checking web impact (among other things). The
             | only thing is, most of these ads are deals made weeks or
             | months in advance, so customers checked analytics about
             | once before a renewal... so not sure it needed to be near
             | real time...
        
             | wging wrote:
             | https://mcfunley.com/whom-the-gods-would-destroy-they-
             | first-...
        
           | nemothekid wrote:
           | I agree. Unless the downstream data is going to be used to
           | feed a system to make automated decisions (ex. HFT or Ad
           | buying), having real time analytics is usually never worth
           | the cost. It's almost always easier and more robust to have
           | high tail latencies for humans to consume and as computers
           | get faster and faster that tail latency decreases.
           | 
           | Systems that needed complex streaming architectures in 2015
           | could probably be handled today with fast disk and large
           | postgres instance (or BigQuery).
        
             | porridgeraisin wrote:
             | Many successful ads feedback loops run at 15 minute
             | granularities as well!
        
           | wwarner wrote:
           | personally i think streaming is quite a bit simpler. but as
           | you you point out, no one cares!
        
         | timeinput wrote:
         | Fundamentally I think the question is what kind of streams are
         | you processing?
         | 
         | My concept of stream processing is trying to process gigabits
         | to gigabytes a second, and turn it into something much much
         | smaller so that it's manageable to database and analyze. To my
         | mind for 'stream processing' calling malloc is sometimes too
         | expensive let alone using any of the technologies called out in
         | this tech stack.
         | 
         | I understand back pressure, and circuit breakers, but they have
         | to happen at the OS / process level (for my general work) -- a
         | metric that auto scales a microservice worker after going
         | through prometheus + an HPA or something like that ends up with
         | too many inefficiencies to make things practical. A few threads
         | on a single machine just work, but end up taking ages to
         | engineer a 'cloud native' solution.
         | 
         | Once I'm down to a job a second (and that job takes more than a
         | few seconds to run to hide the framework's overhead) or less
         | things like Airflow start to work, and not just fall flat, but
         | at that point are these expensive frame works worth it? I'm
         | only producing 1-1000 jobs a second.
         | 
         | Stream processing with these frameworks like Faust, Airflow,
         | Kafka Streams etc, all just seem like brittle overkill once you
         | start trying to actually deploy and use them. How do I tune the
         | PostgreSQL database for Airflow? How do I manage my S3 life
         | cycles to minimize cost?
         | 
         | A task queue + an HPA really feels more like the right kind of
         | thing to me at that scale vs really caring too much about back
         | pressure, etc when the data rate is 'low', but I've generally
         | been told by colleagues to reach for more complicated stream
         | processors that perform worse, are (IMO) harder to orchestrate,
         | and (IMO) harder to manage and deploy.
        
         | jandrewrogers wrote:
         | There are both technical and organizational challenges created
         | by stream processing. I like stream processing and have done a
         | lot of work on high-performance stream engines but I am not
         | blind to the practical issues.
         | 
         | Companies are organized around an operational tempo that
         | reflects what their systems are capable of. Even if you replace
         | one of their systems with a real-time or quasi-real-time stream
         | processing architecture, nothing else in the organization
         | operates with that low of a latency, including the people. It
         | is a very heavy lift to even ask them to reorganize the way
         | they do things.
         | 
         | A related issue is that stream processing systems still work
         | poorly for some data models and often don't scale well. Most
         | implementations place narrow constraints on the properties of
         | the data models and their statefulness. If you have a system
         | sitting in the middle of your operational data model that
         | requires logic which does not fit within those limitations then
         | the whole exercise starts to break down. Despite its many
         | downsides, batching generalizes much better and more easily
         | than stream processing. This could be ameliorated with better
         | stream processing tech (as in, core data structures,
         | algorithms, and architecture) but there hasn't been much
         | progress on that front.
        
       | jll29 wrote:
       | Very interesting - is WARC support on the roadmap?
        
         | dayjah wrote:
         | Do you mean this:
         | https://en.m.wikipedia.org/wiki/WARC_(file_format) ?
         | 
         | Can you help me understand how this would plug into stream
         | processing? My immediate thought is for web page interaction
         | replays -- but that seems sort of exotic a use case?
        
       | gotoeleven wrote:
       | How do the creators of this plan to make money?
        
         | beanjuiceII wrote:
         | get people onboard as open source..then flip to some other
         | license add some pricing tiers and now those users become
         | customers even if they don't like it. tried and true
         | methodology
        
       | insane_dreamer wrote:
       | Does this include broker capabilities? If not, what's a
       | recommended broker these days (for hosting in the cloud, i.e., an
       | EC2 instance; I know AWS has its own Mqtt Broker but it's quite
       | pricy for high volumes).
        
       | xyst wrote:
       | So Kafka Connect and Kafka Streams but with rust?
        
       ___________________________________________________________________
       (page generated 2025-04-29 23:00 UTC)