[HN Gopher] ArkFlow: High-performance Rust stream processing engine
___________________________________________________________________
ArkFlow: High-performance Rust stream processing engine
Author : klaussilveira
Score : 125 points
Date : 2025-04-29 14:38 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| habobobo wrote:
| Looks interesting, how does this compare to arroyo and
| vector.dev?
| tormeh wrote:
| Also curious about any comparison to Fluvio.
| necubi wrote:
| (I'm the creator of Arroyo)
|
| I haven't dug deep into this project, so take this with a grain
| of salt.
|
| ArkFlow is a "stateless" stream processor, like vector or
| benthos (now Redpanda Connect). These are great for routing
| data around your infrastructure while doing simple, stateless
| transformations on them. They tend to be easy to run and scale,
| and are programmed by manually constructing the graph of
| operations.
|
| Arroyo (like Flink or Rising Wave) is a "stateful" stream
| processor, which means it supports operations like windowed
| aggregations, joins, and incremental SQL view maintenance.
| Arroyo is programmed declaratively via SQL, which is
| automatically planned into a dataflow (graph) representation.
| The tradeoff is that state is hard to manage, and these systems
| are much harder to operate and scale (although we've done a lot
| of work with Arroyo to mitigate this!).
|
| I wrote about the difference at length here:
| https://www.arroyo.dev/blog/stateful-stream-processing
| fer wrote:
| Previous discussion (46 days ago):
| https://news.ycombinator.com/item?id=43358682
| shawabawa3 wrote:
| seems like a simplified equivalent of https://vector.dev/
|
| a major difference seems to be converting things to arrow and
| using SQL instead of using a DSL (vrl)
| sofixa wrote:
| > seems like a simplified equivalent of https://vector.dev/
|
| No? Vector is for observability, to get your metrics/logs,
| transform them if needed, and put them in the necessary
| backends. Transformation is optional, and for cases like
| downsampling or converting formats or adding metadata.
|
| ArkFlow gets data from stuff like databes and message
| queues/brokers, transforms it, and puts it back in databases
| and message queues/brokers. Transformation looks like a pretty
| central use case.
|
| Very different scenarios. It's like saying that a Renault
| Kangoo is a simplified equivalent of a BTR-80 because both have
| wheels, engine and space for stuff.
| rockwotj wrote:
| Its a rust port of Redpanda Connect (benthos), but with less
| connectors
|
| https://github.com/redpanda-data/connect
| necubi wrote:
| Vector is often used for observability data (in part because
| it's now owned by Datadog) but it's not limited to that. It's
| a general purpose stateless stream processing engine, and can
| be used for any kind of events.
| sofixa wrote:
| Vector started for observability data only, and that's why
| they got bought by Datadog.
| hoherd wrote:
| Incidentally arkflow implements VRL https://github.com/arkflow-
| rs/arkflow/pull/273
| muffa wrote:
| Looks very similar to redpanda-connect/benthos
| coreyoconnor wrote:
| How do you educate people on stream processing? For pipeline like
| systems stream processing is essential IMO - backpressure/circuit
| breakers/etc are critical for resilient systems. Yet I have a
| hard time building an engineering team that can utilize stream
| processing; Instead of just falling back on synchronous
| procedures that are easier to understand (But nearly always
| slower and more error prone)
| serial_dev wrote:
| It's important to consider whether it's worth it, even?
|
| I worked on stream processing, it was fun, but I also believe
| it was over-engineered and brittle. The customers also didn't
| want real-time data, they looked at the calculated values once
| a week, then made decisions based on that.
|
| Then, I joined another company that somehow had money to pay
| 50-100 people, and they were using CSV, sh scripts, batch
| processing, and all that. It solved the clients' needs, and
| they didn't need to maintain a complicated architecture and the
| code that could have been difficult to reason about otherwise.
|
| The first company with the stream processing after I left, was
| bought by a competitor at fire sale price, some of the tech
| were relevant for them, but the stream processing stuff was
| immediately shut down. The acquiring company had just simple
| batch processing and they were printing money in comparison.
|
| If you think it's still worth going with stream processing,
| give your reasoning to the team, and most reasonable developers
| would learn it if they really believe it's a significantly
| better solution for the given problem.
|
| Not to over-simplify, but if you can't convince 5 out of 10
| people to learn to make their job better, it's either that the
| people are not up to the task, or you are wrong that stream
| processing would make a difference.
| senderista wrote:
| Yeah that reminds me of a startup I worked at that did real-
| time analytics for digital marketing campaigns. We went to
| all kinds of trouble to update dashboards with 5-minute
| latency, and real-time updates made for impressive sales
| demos, but I don't think we had a single customer that
| actually needed to make business decisions within 24 hours of
| looking at the data.
| serial_dev wrote:
| We were doing TV ads analytics by detecting ads on TV
| channels and checking web impact (among other things). The
| only thing is, most of these ads are deals made weeks or
| months in advance, so customers checked analytics about
| once before a renewal... so not sure it needed to be near
| real time...
| wging wrote:
| https://mcfunley.com/whom-the-gods-would-destroy-they-
| first-...
| nemothekid wrote:
| I agree. Unless the downstream data is going to be used to
| feed a system to make automated decisions (ex. HFT or Ad
| buying), having real time analytics is usually never worth
| the cost. It's almost always easier and more robust to have
| high tail latencies for humans to consume and as computers
| get faster and faster that tail latency decreases.
|
| Systems that needed complex streaming architectures in 2015
| could probably be handled today with fast disk and large
| postgres instance (or BigQuery).
| porridgeraisin wrote:
| Many successful ads feedback loops run at 15 minute
| granularities as well!
| wwarner wrote:
| personally i think streaming is quite a bit simpler. but as
| you you point out, no one cares!
| timeinput wrote:
| Fundamentally I think the question is what kind of streams are
| you processing?
|
| My concept of stream processing is trying to process gigabits
| to gigabytes a second, and turn it into something much much
| smaller so that it's manageable to database and analyze. To my
| mind for 'stream processing' calling malloc is sometimes too
| expensive let alone using any of the technologies called out in
| this tech stack.
|
| I understand back pressure, and circuit breakers, but they have
| to happen at the OS / process level (for my general work) -- a
| metric that auto scales a microservice worker after going
| through prometheus + an HPA or something like that ends up with
| too many inefficiencies to make things practical. A few threads
| on a single machine just work, but end up taking ages to
| engineer a 'cloud native' solution.
|
| Once I'm down to a job a second (and that job takes more than a
| few seconds to run to hide the framework's overhead) or less
| things like Airflow start to work, and not just fall flat, but
| at that point are these expensive frame works worth it? I'm
| only producing 1-1000 jobs a second.
|
| Stream processing with these frameworks like Faust, Airflow,
| Kafka Streams etc, all just seem like brittle overkill once you
| start trying to actually deploy and use them. How do I tune the
| PostgreSQL database for Airflow? How do I manage my S3 life
| cycles to minimize cost?
|
| A task queue + an HPA really feels more like the right kind of
| thing to me at that scale vs really caring too much about back
| pressure, etc when the data rate is 'low', but I've generally
| been told by colleagues to reach for more complicated stream
| processors that perform worse, are (IMO) harder to orchestrate,
| and (IMO) harder to manage and deploy.
| jandrewrogers wrote:
| There are both technical and organizational challenges created
| by stream processing. I like stream processing and have done a
| lot of work on high-performance stream engines but I am not
| blind to the practical issues.
|
| Companies are organized around an operational tempo that
| reflects what their systems are capable of. Even if you replace
| one of their systems with a real-time or quasi-real-time stream
| processing architecture, nothing else in the organization
| operates with that low of a latency, including the people. It
| is a very heavy lift to even ask them to reorganize the way
| they do things.
|
| A related issue is that stream processing systems still work
| poorly for some data models and often don't scale well. Most
| implementations place narrow constraints on the properties of
| the data models and their statefulness. If you have a system
| sitting in the middle of your operational data model that
| requires logic which does not fit within those limitations then
| the whole exercise starts to break down. Despite its many
| downsides, batching generalizes much better and more easily
| than stream processing. This could be ameliorated with better
| stream processing tech (as in, core data structures,
| algorithms, and architecture) but there hasn't been much
| progress on that front.
| jll29 wrote:
| Very interesting - is WARC support on the roadmap?
| dayjah wrote:
| Do you mean this:
| https://en.m.wikipedia.org/wiki/WARC_(file_format) ?
|
| Can you help me understand how this would plug into stream
| processing? My immediate thought is for web page interaction
| replays -- but that seems sort of exotic a use case?
| gotoeleven wrote:
| How do the creators of this plan to make money?
| beanjuiceII wrote:
| get people onboard as open source..then flip to some other
| license add some pricing tiers and now those users become
| customers even if they don't like it. tried and true
| methodology
| insane_dreamer wrote:
| Does this include broker capabilities? If not, what's a
| recommended broker these days (for hosting in the cloud, i.e., an
| EC2 instance; I know AWS has its own Mqtt Broker but it's quite
| pricy for high volumes).
| xyst wrote:
| So Kafka Connect and Kafka Streams but with rust?
___________________________________________________________________
(page generated 2025-04-29 23:00 UTC)