[HN Gopher] Building a Durable Execution Engine with SQLite
___________________________________________________________________
Building a Durable Execution Engine with SQLite
Author : ingve
Score : 88 points
Date : 2025-11-20 13:26 UTC (1 days ago)
(HTM) web link (www.morling.dev)
(TXT) w3m dump (www.morling.dev)
| fiddlerwoaroof wrote:
| Every several years people reinvent serializable continuations
| andersmurphy wrote:
| Haha so true. Shame image based programming never really caught
| on.
|
| Janet lang lets you serialize coroutines which is fun. Make
| this sort of stuff trivial.
| gunnarmorling wrote:
| Yupp, making that same point in the post :)
|
| > You could think of [Durable Execution] as a persistent
| implementation of the memoization pattern, or a persistent form
| of continuations.
| smitty1e wrote:
| Is this reinvention somehow "transactional" in nature?
| rileymichael wrote:
| unfortunately they've never really taken off so folks reach for
| explicit state machines instead. there have been a handful of
| options on the jvm over the years (e.g. quasar, kilim) but
| they're all abandoned now, the loom continuation API is
| internal with no hint of it becoming public, kotlin's aren't
| serializable and the issue is inactive
| (https://github.com/Kotlin/kotlinx.coroutines/issues/76), etc..
| such a shame
| websiteapi wrote:
| there's a lot of hype around durable execution these days. why do
| that instead of regular use of queues? is it the dev ergonomics
| that's cool here?
|
| you can (and people already) model steps in any arbitrarily large
| workflow and have those results be processed in a modular fashion
| and have whatever process that begins this workflow check the
| state of the necessary preconditions prior to taking any action
| and thus go to the currently needed step, or retry ones that
| failed, and so forth.
| tptacek wrote:
| We build what is effectively a durable execution "engine" for
| our orchestrator (ours is backed by boltdb and not SQLite,
| which I objected to, correctly). The steps in our workflows
| build running virtual machines and include things like
| allocating addresses, loading BPF programs, preparing root
| filesystems, and registering services.
|
| Short answer: we need to be able to redeploy and bounce the
| orchestrator without worrying about what stage each running VM
| on our platform is in.
|
| JP, the dev that built this out for us, talks a bit about the
| design rationale (search for "Cadence") here:
|
| https://fly.io/blog/the-exit-interview-jp/
|
| The library itself is open:
|
| https://github.com/superfly/fsm
| ryeats wrote:
| As you say it can be done but it's an anti-pattern to use a
| message queue as a database which is essentially what you are
| doing for these kinds of long running tasks. The reason is that
| their are a lot of state your likely going to want to status as
| a task runs and persist and checkpoint yes you can carefully
| string together a series of database calls chained with message
| transactions so you don't lose something when an issue happens
| but then you also need bespoke logic to restart or retry each
| step and it can turn into a bit of a mess.
| snicker7 wrote:
| Message queues (e.g. SQS) are inappropriate for tracking long-
| running tasks/workflows. This is due to the operational
| requirements such as:
|
| - Checking the status of a task (queued, pending, failed,
| cancelled, completed) - Cancelling a queued task (or pending
| task if the execution environment supports it) - Re-
| prioritizing queued tasks - Searching for tasks based off an
| attribute (e.g. tag)
|
| You really do need a database for this.
| yyx wrote:
| Sounds like a Celery with SQLAlchemy backend.
| DenisM wrote:
| I'm reminded of classical LRU cache implementation - double
| linked list and a hash map that points to the list elements.
|
| It is a queue if we squint really hard, but it allows random
| access and reordering. Do we have durable structures of this
| kind?
|
| I can't imagine how to shoehorn this into Kafka or SQS.
| kodablah wrote:
| > is it the dev ergonomics that's cool here?
|
| Yup. Being able to write imperative code that automatically
| resumes where it left off is very valuable. It's best to
| represent durable turing completeness using modern approaches
| of authoring such logic - programming languages. Being able to
| loop, try/catch, apply advanced conditional logic, etc in a
| crash-proof algorithm that can run for weeks/months/years and
| is introspectable has a lot of value over just using queues.
|
| Durable execution is all just queues and task processing and
| event sourcing under the hood though.
| hmaxdml wrote:
| The hype is because DE is such an dev exp improvement over
| building your own queue. Good DE frameworks come with
| workflows, pub/sub, notifications, distributed queues with tons
| of flow control options, etc.
| the_mitsuhiko wrote:
| I think this is great. We should see more simple solution to this
| problem.
|
| I recently started doing something very similar on Postgres [1]
| and I'm greatly enjoying using it. I think the total solution I
| ended up with is under 3000 lines of code for both the SQL and
| the TypeScript SDK combined, and it's much easier to use and to
| operate than many of the solutions on the market today.
|
| [1]: https://github.com/earendil-works/absurd
| gunnarmorling wrote:
| Ah, that's awesome. Definitely need to take a closer look at
| your implementation.
| nileshtrivedi wrote:
| This looks useful.
|
| Hope would you say it compares with pgqueuer?
| the_mitsuhiko wrote:
| I _think_ pgqueuer is like pgmq which I used a lot. pgmq is
| just the queue part, absurd does the state storage. I wrote
| some more about why it exists here:
| https://lucumr.pocoo.org/2025/11/3/absurd-workflows/
| qianli_cs wrote:
| I really enjoyed this post and love seeing more lightweight
| approaches! The deep dive on tradeoffs between different durable-
| execution approaches was great. For me, the most interesting part
| is that Persistasaurus (cool name btw) use of bytecode generation
| via ByteBuddy is a clever way to improve DX: it can transparently
| intercept step functions and capture execution state without
| requiring explicit API calls.
|
| (Disclosure: I work on DBOS [1]) The author's point about the
| friction from explicit step wrappers is fair, as we don't use
| bytecode generation today, but we're actively exploring it to
| improve DX.
|
| [1]: https://github.com/dbos-inc
| kodablah wrote:
| > The author's point about the friction from explicit step
| wrappers is fair, as we don't use bytecode generation today,
| but we're actively exploring it to improve DX.
|
| There is value in such a wrapper/call at invocation time
| instead of using the proxy pattern. Specifically, it makes it
| very clear to both the code author and code reader that this is
| not a normal method invocation. This is important because it is
| very common to perform normal method invocations and the caller
| needs to author code knowing the difference. Java developers,
| perhaps more than most, likely prefer such invocation
| explicitness over a JVM agent doing byte code manip.
|
| There is also another reason for preferring a wrapped-like
| approach - providing options. If you need to provide options
| (say timeout info) from the call site, it is hard to do if your
| call is limited to the signature of the implementation and
| options will have to be provided in a different place.
| gunnarmorling wrote:
| I'm still swinging back and forth which approach I ultimately
| prefer.
|
| As stated in the post, I like how the proxy approach largely
| avoids any API dependency. I'd also argue that Java
| developers actually are very familiar with this kind of
| implicit enrichment of behaviors and execution semantics
| (e.g. transaction management is weaved into applications that
| way in Spring or Quarkus applications).
|
| But there's also limits to this in regards to flexibility.
| For example, if you wanted to delay a method for a
| dynamically determined period of time, rather than for a
| fixed time, the annotation-based approach would fall short.
| kodablah wrote:
| At Temporal, for Java we did a hybrid approach of what you
| have. Specifically, we do the java.lang.reflect.Proxy
| approach, but the user has to make a call instantiating it
| from the implementation. This allows users to provide those
| options at proxy creation time and not require they
| configure a build step. I can't speak for all JVM people,
| but I get nervous if I have to use a library that requires
| an agent or annotation processor.
|
| Also, since Temporal activity invocations are (often)
| remote, many times a user may only have the
| definition/contract of the "step" (aka activity in Temporal
| parlance) without a body. Finally, many times users _start_
| the "step", not just _execute_ it, which means it needs to
| return a promise/future/task. Sure this can be wrapped in a
| suspended virtual thread, but it makes reasoning about
| things like cancellation harder, and from a client-not-
| workflow POV, it makes it harder to reattach to an
| invocation in a type-safe way to, say, wait for the result
| of something started elsewhere.
|
| We did the same proxying approach for TypeScript, but we
| saw as we got to Python, .NET, and Ruby that being able to
| _reference_ a "step" while also providing options and
| having many overloads/approaches of invoking that step has
| benefits.
| roughly wrote:
| One thing that needs to be emphasized with "durable execution"
| engines is they don't actually get you out of having to handle
| errors, rollbacks, etc. Even the canonical examples everyone uses
| - so you're using a DE engine to restart a sales transaction, but
| the part of that transaction that failed was "charging the
| customer" - did it fail before or after the charge went through?
| You failed while updating the inventory system - did the product
| get marked out or not? All of these problems are tractable, but
| once you've solved them - once you've built sufficient atomicity
| into your system to handle the actual failure cases - the
| benefits of taking on the complexity of a DE system are
| substantially lower than the marketing pitch.
| hedgehog wrote:
| In my one encounter with one of these systems it induced new
| code and tooling complexity, orders of magnitude performance
| overhead for most operations, and made dev and debug workflows
| much slower. All for... an occasional convenience far
| outweighed by the overall drag of using it. There are probably
| other environments where something like this makes sense but I
| can't figure out what they are.
| throwaway894345 wrote:
| > All for... an occasional convenience far outweighed by the
| overall drag of using it
|
| If you have any long-running operation that could be
| interrupted mid-run by any network fluke (or the termination
| of the VM running your program, or your program being OOMed,
| or some issue with some third party service that your app
| talks to, etc), and you don't want to restart the whole thing
| from scratch, you could benefit from these systems. The
| alternative is having engineers manually try to repair the
| state and restart execution in just the right place and that
| scales very badly.
|
| I have an application that needs to stand up a bunch of cloud
| infrastructure (a "workspace" in which users can do research)
| on the press of a button, and I want to make sure that the
| right infrastructure exists even if some deployment attempt
| is interrupted or if the upstream definition of a workspace
| changes. Every month there are dozens of network flukes or
| 5XX errors from remote endpoints that would otherwise leave
| these workspaces in a broken state and in need of manual
| repair. Instead, the system heals itself whenever the fault
| clears and I basically never have to look at the system (I
| periodically check the error logs, however, to confirm that
| the system is actually recovering from faults--I worry that
| the system has caught fire and there's actually some bug in
| the alerting system that is keeping things quiet).
| jedberg wrote:
| I'm not sure which one you used, but ideally it's so
| lightweight that the benefits outweigh the slight cost of
| developing with them. Besides the recovery benefit, there is
| observability and debugging benefits too.
| throwaway894345 wrote:
| > they don't actually get you out of having to handle errors
|
| I wrote a durable system that recovers from all sorts of errors
| (mostly network faults) without writing much error handling
| code. It just retries automatically, and importantly the happy
| path and the error path are exactly the same, so I don't have
| to worry that my error path has much less execution than my
| happy path.
|
| > but the part of that transaction that failed was "charging
| the customer" - did it fail before or after the charge went
| through?
|
| In all cases, whether the happy path or the error path, the
| first thing you do is compare the desired state ("there exists
| a transaction exists charging the customer $5") with the actual
| state ("has the customer been charged $5?") and that determines
| whether you (re)issue the transaction or just update your
| internal state.
|
| > once you've built sufficient atomicity into your system to
| handle the actual failure cases - the benefits of taking on the
| complexity of a DE system are substantially lower than the
| marketing pitch
|
| I probably agree with this. The main value is probably not in
| the framework but rather in the larger architecture that it
| encourages--separating things out into idempotent functions
| that can be safely retried. I could maybe be persuaded
| otherwise, but most of my "durable execution" patterns seem to
| be more of a "controller pattern" (in the sense of a Kubernetes
| controller, running a reconciling control loop) and it just
| happens that any distributed, durable controller platform
| includes a durable execution subsystem.
| jedberg wrote:
| The key to a durable workflow is making each step idempotent.
| Then you don't have to worry about those things. You just run
| the failed step again. If it already worked the first time,
| it's a no-op.
|
| For example, stripe lets you include an idempotency key with
| your request. If you try to make a charge again with the same
| key, it ignores you. A DE framework like DBOS will
| automatically generate the idempotency key for you.
|
| But you're correct, if you can't make the operation idempotent,
| then you have to handle that yourself.
| repeekad wrote:
| Temporal plus idempotency keys solves probably the majority
| of infrastructure normally needed for production systems
| whinvik wrote:
| Sorry for the off-topic but I have been lately seeing a lot of
| hype around durable execution.
|
| I still cannot figure out how this is any different than
| launching a workflow in something like Airflow. Is the novel
| thing here that it can be done using the same DB you already have
| running?
___________________________________________________________________
(page generated 2025-11-21 23:00 UTC)