[HN Gopher] Running Durable Workflows in Postgres Using DBOS
       ___________________________________________________________________
        
       Running Durable Workflows in Postgres Using DBOS
        
       Author : kiwicopple
       Score  : 66 points
       Date   : 2024-12-10 18:47 UTC (4 hours ago)
        
 (HTM) web link (supabase.com)
 (TXT) w3m dump (supabase.com)
        
       | dangoodmanUT wrote:
       | Maybe I'm not seeing it, but why do none of these "postgres
       | durable packages" ever integrate with existing transactions?
       | That's the biggest flaw of temporal, and seems so obvious that
       | they can hook into transactions to ensure that starting a
       | workflow is transactionally secure with other operations, and you
       | don't have to manage idempotency yourself (or worry about
       | handling "already started" errors and holding up DB connections
       | bc you launched in transaction)
        
         | KraftyOne wrote:
         | (DBOS co-founder here) DBOS does exactly this! From the post:
         | 
         | DBOS has a special @DBOS.Transaction decorator. This runs the
         | entire step inside a Postgres transaction. This guarantees
         | exactly-once execution for databases transactional steps.
        
           | dangoodmanUT wrote:
           | Sorry, i mean with external transactions to the workflow
           | steps. Like I can select, insert, and launch a workflow in a
           | HTTP handler
        
             | KraftyOne wrote:
             | Yeah, you can launch workflows directly from an HTTP
             | handler. So here's some code that idempotently launches a
             | background task from a FastAPI endpoint:
             | @app.get("/background/{task_id}/{n}")         def
             | launch_background_task(task_id: str, n: int) -> None:
             | with SetWorkflowID(task_id): # Set an idempotency key
             | DBOS.start_workflow(background_task, n) # Start the
             | workflow in the background
             | 
             | Does that answer your question?
        
       | yandie wrote:
       | Temporal can use postgres as a backend. What makes this product
       | different?
        
         | dangoodmanUT wrote:
         | Seems like you didn't read any of it
        
         | KraftyOne wrote:
         | DBOS Co-founder here. Great question! The big difference is
         | that DBOS runs as a library inside your program, whereas the
         | Temporal architecture is an external workflow server managing
         | tasks on distributed workers. The advantages of DBOS are:
         | 
         | 1. Simpler architecturally. Just a normal Python process versus
         | a workflow server coordinating multiple workers.
         | 
         | 2. Easier dev/debugging. A Temporal program is essentially
         | distributed microservices, with control flow split between the
         | workflow server and workers. That complicates any
         | testing/debugging because you have to work through multiple
         | components. Whereas DBOS is just your process.
         | 
         | 3. Performance. A state transition in DBOS requires only a
         | database write (~1 ms) whereas in Temporal it requires an async
         | dispatch from the workflow server (tens-hundreds of ms).
        
           | abtinf wrote:
           | Note that it is possible to embed temporal server into a
           | program. I wrote a simple demo that embeds
           | client/worker/server all into one Go app. It should be
           | straightforward to modify this to support clustering.
           | 
           | https://github.com/abtinf/temporal-a-
           | day/blob/main/001-all-i...
        
       | exceptione wrote:
       | Bit disappointed, looked for .net core support but no.
       | 
       | Languages that are supported: Typescript and Python. I know
       | programming languages as a topic is as inflammable as religion,
       | but boy do I feel sad that these two are considered the most
       | important these days. For server applications.
       | 
       | Anyways, can people here recommend alternatives with bindings for
       | .net core?
        
         | jedberg wrote:
         | Interesting, I don't think we've ever gotten a request for .net
         | support. Our two most popular asks after PY and TS are Golang
         | and Java, which basically tracks with the StackOverflow
         | language survey:
         | 
         | https://survey.stackoverflow.co/2024/technology#most-popular...
        
         | nawgz wrote:
         | TypeScript has the best developer ergonomics
         | 
         | Python has
         | 
         | So it's obvious why people would only support these two
         | languages even when their performance isn't best in class
        
       | adamgordonbell wrote:
       | I built a small side thing using DBOS ( using python SDK) and the
       | ergonomics were pretty nice.
       | 
       | Then I found out Qian Li and Peter Kraft from that team are
       | sharing breakdowns of interesting database research papers on
       | twitter and I've been following them for that.
       | 
       | https://x.com/petereliaskraft/status/1862937787420295672
        
       | hotpocket777 wrote:
       | Does DBOS offer a way to get messages in or data out of running
       | workflows? (Similar to signals/queries in Temporal) Interested in
       | long-running workflows, and this particular area seemed to be
       | lacking last time I looked into it.
       | 
       | I don't want to _just_ sleep; I want a workflow to be able to
       | respond to an event.
        
         | KraftyOne wrote:
         | Yes, absolutely, that's an extremely important use case! In
         | DBOS you can send messages to workflows and workflows can
         | publish events that others can read. It's all backed by
         | Postgres (LISTEN/NOTIFY under the hood).
         | 
         | Documentation: https://docs.dbos.dev/python/tutorials/workflow-
         | tutorial#wor...
         | 
         | Here's a demo e-commerce application that uses messages and
         | events to build an interactive long-running checkout workflow:
         | https://docs.dbos.dev/python/examples/widget-store
        
       | antics wrote:
       | > # Exactly-once execution
       | 
       | > DBOS has a special @DBOS.Transaction decorator. This runs the
       | entire step inside a Postgres transaction. This guarantees
       | exactly-once execution for databases transactional steps.
       | 
       | Totally awesome, great work, just a small note... IME a lot of
       | (most?) pg deployments have synchronous replication turned off
       | because it is very tricky to get it to perform well[1]. If you
       | have it turned off, pg could journal the step, formally
       | acknowledge it, and then (as I understand DBOS) totally lose that
       | journal when the primary fails, causing you to re-run the step.
       | 
       | When I was on call for pg last, failover with some data loss
       | happened to me twice. So it does happen. I think this is worth
       | noting because if you plan for this to be a hard requirement,
       | (unless I'm mistaken) you need to set up sync replication or you
       | need to plan for this to possibly fail.
       | 
       | Lastly, note that the pg docs[1] have this to say about sync
       | replication:
       | 
       | > Synchronous replication usually requires carefully planned and
       | placed standby servers to ensure applications perform acceptably.
       | Waiting doesn't utilize system resources, but transaction locks
       | continue to be held until the transfer is confirmed. As a result,
       | incautious use of synchronous replication will reduce performance
       | for database applications because of increased response times and
       | higher contention.
       | 
       | I see the DBOS author around here somewhere so if the state of
       | the art for DBOS has changed please do let me know and I'll
       | correct the comment.
       | 
       | [1] https://www.postgresql.org/docs/current/warm-
       | standby.html#SY...
        
         | KraftyOne wrote:
         | Yeah, that's totally fair--DBOS is totally built on Postgres,
         | so it can't provide stronger durability guarantees than your
         | Postgres does. If Postgres loses data, then DBOS can lose data
         | too. There's no way around that if you're using Postgres for
         | data storage, no matter how you architect the system.
        
       | efxhoy wrote:
       | Interesting! Is there anything like this that I could host
       | myself?
        
         | KraftyOne wrote:
         | Yeah, the core DBOS library is totally open-source and you can
         | run it anywhere as long as it has a Postgres to connect to.
         | Check it out: https://github.com/dbos-inc/dbos-transact-py
        
       | abelanger wrote:
       | Disclaimer: I'm a co-founder of Hatchet
       | (https://github.com/hatchet-dev/hatchet), which is a Postgres-
       | backed task queue that supports durable execution.
       | 
       | > Because a step transition is just a Postgres write (~1ms)
       | versus an async dispatch from an external orchestrator (~100ms),
       | it means DBOS is 25x faster than AWS Step Functions
       | 
       | Durable execution engines deployed as an external orchestrator
       | will always been slower than direct DB writes, but the 1ms delay
       | versus ~100ms doesn't seem inherent to the orchestrator being
       | external. In the case of Hatchet, pushing work takes ~15ms and
       | invoking the work takes ~1ms if deployed in the same VPC, and 90%
       | of that execution time is on the database. In the best-case, the
       | external orchestrator should take 2x as long to write a step
       | transition (round-trip network call to the orchestrator +
       | database write), so an ideal external orchestrator would be ~2ms
       | of latency here.
       | 
       | There are also some tradeoffs to a library-only mode that aren't
       | discussed. How would work that requires global coordination
       | between workers behave in this model? Let's say, for example, a
       | global rate limit -- you'd ideally want to avoid contention on
       | rate limit rows, assuming they're stored in Postgres, but each
       | worker attempting to acquire a rate limit simultaneously would
       | slow down start time significantly (and place additional load on
       | the DB). Whereas with a single external orchestrator (or leader
       | election), you can significantly increase throughput by acquiring
       | rate limits as part of a push-based assignment process.
       | 
       | The same problem of coordination arises if many workers are
       | competing for the same work -- for example if a machine crashes
       | while doing work, as described in the article. I'm assuming
       | there's some kind of polling happening which uses FOR UPDATE SKIP
       | LOCKED, which concerns me as you start to scale up the number of
       | workers.
        
       ___________________________________________________________________
       (page generated 2024-12-10 23:00 UTC)