[HN Gopher] Building a Durable Execution Engine with SQLite
       ___________________________________________________________________
        
       Building a Durable Execution Engine with SQLite
        
       Author : ingve
       Score  : 88 points
       Date   : 2025-11-20 13:26 UTC (1 days ago)
        
 (HTM) web link (www.morling.dev)
 (TXT) w3m dump (www.morling.dev)
        
       | fiddlerwoaroof wrote:
       | Every several years people reinvent serializable continuations
        
         | andersmurphy wrote:
         | Haha so true. Shame image based programming never really caught
         | on.
         | 
         | Janet lang lets you serialize coroutines which is fun. Make
         | this sort of stuff trivial.
        
         | gunnarmorling wrote:
         | Yupp, making that same point in the post :)
         | 
         | > You could think of [Durable Execution] as a persistent
         | implementation of the memoization pattern, or a persistent form
         | of continuations.
        
         | smitty1e wrote:
         | Is this reinvention somehow "transactional" in nature?
        
         | rileymichael wrote:
         | unfortunately they've never really taken off so folks reach for
         | explicit state machines instead. there have been a handful of
         | options on the jvm over the years (e.g. quasar, kilim) but
         | they're all abandoned now, the loom continuation API is
         | internal with no hint of it becoming public, kotlin's aren't
         | serializable and the issue is inactive
         | (https://github.com/Kotlin/kotlinx.coroutines/issues/76), etc..
         | such a shame
        
       | websiteapi wrote:
       | there's a lot of hype around durable execution these days. why do
       | that instead of regular use of queues? is it the dev ergonomics
       | that's cool here?
       | 
       | you can (and people already) model steps in any arbitrarily large
       | workflow and have those results be processed in a modular fashion
       | and have whatever process that begins this workflow check the
       | state of the necessary preconditions prior to taking any action
       | and thus go to the currently needed step, or retry ones that
       | failed, and so forth.
        
         | tptacek wrote:
         | We build what is effectively a durable execution "engine" for
         | our orchestrator (ours is backed by boltdb and not SQLite,
         | which I objected to, correctly). The steps in our workflows
         | build running virtual machines and include things like
         | allocating addresses, loading BPF programs, preparing root
         | filesystems, and registering services.
         | 
         | Short answer: we need to be able to redeploy and bounce the
         | orchestrator without worrying about what stage each running VM
         | on our platform is in.
         | 
         | JP, the dev that built this out for us, talks a bit about the
         | design rationale (search for "Cadence") here:
         | 
         | https://fly.io/blog/the-exit-interview-jp/
         | 
         | The library itself is open:
         | 
         | https://github.com/superfly/fsm
        
         | ryeats wrote:
         | As you say it can be done but it's an anti-pattern to use a
         | message queue as a database which is essentially what you are
         | doing for these kinds of long running tasks. The reason is that
         | their are a lot of state your likely going to want to status as
         | a task runs and persist and checkpoint yes you can carefully
         | string together a series of database calls chained with message
         | transactions so you don't lose something when an issue happens
         | but then you also need bespoke logic to restart or retry each
         | step and it can turn into a bit of a mess.
        
         | snicker7 wrote:
         | Message queues (e.g. SQS) are inappropriate for tracking long-
         | running tasks/workflows. This is due to the operational
         | requirements such as:
         | 
         | - Checking the status of a task (queued, pending, failed,
         | cancelled, completed) - Cancelling a queued task (or pending
         | task if the execution environment supports it) - Re-
         | prioritizing queued tasks - Searching for tasks based off an
         | attribute (e.g. tag)
         | 
         | You really do need a database for this.
        
           | yyx wrote:
           | Sounds like a Celery with SQLAlchemy backend.
        
           | DenisM wrote:
           | I'm reminded of classical LRU cache implementation - double
           | linked list and a hash map that points to the list elements.
           | 
           | It is a queue if we squint really hard, but it allows random
           | access and reordering. Do we have durable structures of this
           | kind?
           | 
           | I can't imagine how to shoehorn this into Kafka or SQS.
        
         | kodablah wrote:
         | > is it the dev ergonomics that's cool here?
         | 
         | Yup. Being able to write imperative code that automatically
         | resumes where it left off is very valuable. It's best to
         | represent durable turing completeness using modern approaches
         | of authoring such logic - programming languages. Being able to
         | loop, try/catch, apply advanced conditional logic, etc in a
         | crash-proof algorithm that can run for weeks/months/years and
         | is introspectable has a lot of value over just using queues.
         | 
         | Durable execution is all just queues and task processing and
         | event sourcing under the hood though.
        
         | hmaxdml wrote:
         | The hype is because DE is such an dev exp improvement over
         | building your own queue. Good DE frameworks come with
         | workflows, pub/sub, notifications, distributed queues with tons
         | of flow control options, etc.
        
       | the_mitsuhiko wrote:
       | I think this is great. We should see more simple solution to this
       | problem.
       | 
       | I recently started doing something very similar on Postgres [1]
       | and I'm greatly enjoying using it. I think the total solution I
       | ended up with is under 3000 lines of code for both the SQL and
       | the TypeScript SDK combined, and it's much easier to use and to
       | operate than many of the solutions on the market today.
       | 
       | [1]: https://github.com/earendil-works/absurd
        
         | gunnarmorling wrote:
         | Ah, that's awesome. Definitely need to take a closer look at
         | your implementation.
        
         | nileshtrivedi wrote:
         | This looks useful.
         | 
         | Hope would you say it compares with pgqueuer?
        
           | the_mitsuhiko wrote:
           | I _think_ pgqueuer is like pgmq which I used a lot. pgmq is
           | just the queue part, absurd does the state storage. I wrote
           | some more about why it exists here:
           | https://lucumr.pocoo.org/2025/11/3/absurd-workflows/
        
       | qianli_cs wrote:
       | I really enjoyed this post and love seeing more lightweight
       | approaches! The deep dive on tradeoffs between different durable-
       | execution approaches was great. For me, the most interesting part
       | is that Persistasaurus (cool name btw) use of bytecode generation
       | via ByteBuddy is a clever way to improve DX: it can transparently
       | intercept step functions and capture execution state without
       | requiring explicit API calls.
       | 
       | (Disclosure: I work on DBOS [1]) The author's point about the
       | friction from explicit step wrappers is fair, as we don't use
       | bytecode generation today, but we're actively exploring it to
       | improve DX.
       | 
       | [1]: https://github.com/dbos-inc
        
         | kodablah wrote:
         | > The author's point about the friction from explicit step
         | wrappers is fair, as we don't use bytecode generation today,
         | but we're actively exploring it to improve DX.
         | 
         | There is value in such a wrapper/call at invocation time
         | instead of using the proxy pattern. Specifically, it makes it
         | very clear to both the code author and code reader that this is
         | not a normal method invocation. This is important because it is
         | very common to perform normal method invocations and the caller
         | needs to author code knowing the difference. Java developers,
         | perhaps more than most, likely prefer such invocation
         | explicitness over a JVM agent doing byte code manip.
         | 
         | There is also another reason for preferring a wrapped-like
         | approach - providing options. If you need to provide options
         | (say timeout info) from the call site, it is hard to do if your
         | call is limited to the signature of the implementation and
         | options will have to be provided in a different place.
        
           | gunnarmorling wrote:
           | I'm still swinging back and forth which approach I ultimately
           | prefer.
           | 
           | As stated in the post, I like how the proxy approach largely
           | avoids any API dependency. I'd also argue that Java
           | developers actually are very familiar with this kind of
           | implicit enrichment of behaviors and execution semantics
           | (e.g. transaction management is weaved into applications that
           | way in Spring or Quarkus applications).
           | 
           | But there's also limits to this in regards to flexibility.
           | For example, if you wanted to delay a method for a
           | dynamically determined period of time, rather than for a
           | fixed time, the annotation-based approach would fall short.
        
             | kodablah wrote:
             | At Temporal, for Java we did a hybrid approach of what you
             | have. Specifically, we do the java.lang.reflect.Proxy
             | approach, but the user has to make a call instantiating it
             | from the implementation. This allows users to provide those
             | options at proxy creation time and not require they
             | configure a build step. I can't speak for all JVM people,
             | but I get nervous if I have to use a library that requires
             | an agent or annotation processor.
             | 
             | Also, since Temporal activity invocations are (often)
             | remote, many times a user may only have the
             | definition/contract of the "step" (aka activity in Temporal
             | parlance) without a body. Finally, many times users _start_
             | the "step", not just _execute_ it, which means it needs to
             | return a promise/future/task. Sure this can be wrapped in a
             | suspended virtual thread, but it makes reasoning about
             | things like cancellation harder, and from a client-not-
             | workflow POV, it makes it harder to reattach to an
             | invocation in a type-safe way to, say, wait for the result
             | of something started elsewhere.
             | 
             | We did the same proxying approach for TypeScript, but we
             | saw as we got to Python, .NET, and Ruby that being able to
             | _reference_ a "step" while also providing options and
             | having many overloads/approaches of invoking that step has
             | benefits.
        
       | roughly wrote:
       | One thing that needs to be emphasized with "durable execution"
       | engines is they don't actually get you out of having to handle
       | errors, rollbacks, etc. Even the canonical examples everyone uses
       | - so you're using a DE engine to restart a sales transaction, but
       | the part of that transaction that failed was "charging the
       | customer" - did it fail before or after the charge went through?
       | You failed while updating the inventory system - did the product
       | get marked out or not? All of these problems are tractable, but
       | once you've solved them - once you've built sufficient atomicity
       | into your system to handle the actual failure cases - the
       | benefits of taking on the complexity of a DE system are
       | substantially lower than the marketing pitch.
        
         | hedgehog wrote:
         | In my one encounter with one of these systems it induced new
         | code and tooling complexity, orders of magnitude performance
         | overhead for most operations, and made dev and debug workflows
         | much slower. All for... an occasional convenience far
         | outweighed by the overall drag of using it. There are probably
         | other environments where something like this makes sense but I
         | can't figure out what they are.
        
           | throwaway894345 wrote:
           | > All for... an occasional convenience far outweighed by the
           | overall drag of using it
           | 
           | If you have any long-running operation that could be
           | interrupted mid-run by any network fluke (or the termination
           | of the VM running your program, or your program being OOMed,
           | or some issue with some third party service that your app
           | talks to, etc), and you don't want to restart the whole thing
           | from scratch, you could benefit from these systems. The
           | alternative is having engineers manually try to repair the
           | state and restart execution in just the right place and that
           | scales very badly.
           | 
           | I have an application that needs to stand up a bunch of cloud
           | infrastructure (a "workspace" in which users can do research)
           | on the press of a button, and I want to make sure that the
           | right infrastructure exists even if some deployment attempt
           | is interrupted or if the upstream definition of a workspace
           | changes. Every month there are dozens of network flukes or
           | 5XX errors from remote endpoints that would otherwise leave
           | these workspaces in a broken state and in need of manual
           | repair. Instead, the system heals itself whenever the fault
           | clears and I basically never have to look at the system (I
           | periodically check the error logs, however, to confirm that
           | the system is actually recovering from faults--I worry that
           | the system has caught fire and there's actually some bug in
           | the alerting system that is keeping things quiet).
        
           | jedberg wrote:
           | I'm not sure which one you used, but ideally it's so
           | lightweight that the benefits outweigh the slight cost of
           | developing with them. Besides the recovery benefit, there is
           | observability and debugging benefits too.
        
         | throwaway894345 wrote:
         | > they don't actually get you out of having to handle errors
         | 
         | I wrote a durable system that recovers from all sorts of errors
         | (mostly network faults) without writing much error handling
         | code. It just retries automatically, and importantly the happy
         | path and the error path are exactly the same, so I don't have
         | to worry that my error path has much less execution than my
         | happy path.
         | 
         | > but the part of that transaction that failed was "charging
         | the customer" - did it fail before or after the charge went
         | through?
         | 
         | In all cases, whether the happy path or the error path, the
         | first thing you do is compare the desired state ("there exists
         | a transaction exists charging the customer $5") with the actual
         | state ("has the customer been charged $5?") and that determines
         | whether you (re)issue the transaction or just update your
         | internal state.
         | 
         | > once you've built sufficient atomicity into your system to
         | handle the actual failure cases - the benefits of taking on the
         | complexity of a DE system are substantially lower than the
         | marketing pitch
         | 
         | I probably agree with this. The main value is probably not in
         | the framework but rather in the larger architecture that it
         | encourages--separating things out into idempotent functions
         | that can be safely retried. I could maybe be persuaded
         | otherwise, but most of my "durable execution" patterns seem to
         | be more of a "controller pattern" (in the sense of a Kubernetes
         | controller, running a reconciling control loop) and it just
         | happens that any distributed, durable controller platform
         | includes a durable execution subsystem.
        
         | jedberg wrote:
         | The key to a durable workflow is making each step idempotent.
         | Then you don't have to worry about those things. You just run
         | the failed step again. If it already worked the first time,
         | it's a no-op.
         | 
         | For example, stripe lets you include an idempotency key with
         | your request. If you try to make a charge again with the same
         | key, it ignores you. A DE framework like DBOS will
         | automatically generate the idempotency key for you.
         | 
         | But you're correct, if you can't make the operation idempotent,
         | then you have to handle that yourself.
        
           | repeekad wrote:
           | Temporal plus idempotency keys solves probably the majority
           | of infrastructure normally needed for production systems
        
       | whinvik wrote:
       | Sorry for the off-topic but I have been lately seeing a lot of
       | hype around durable execution.
       | 
       | I still cannot figure out how this is any different than
       | launching a workflow in something like Airflow. Is the novel
       | thing here that it can be done using the same DB you already have
       | running?
        
       ___________________________________________________________________
       (page generated 2025-11-21 23:00 UTC)