[HN Gopher] Running Durable Workflows in Postgres Using DBOS
___________________________________________________________________
Running Durable Workflows in Postgres Using DBOS
Author : kiwicopple
Score : 66 points
Date : 2024-12-10 18:47 UTC (4 hours ago)
(HTM) web link (supabase.com)
(TXT) w3m dump (supabase.com)
| dangoodmanUT wrote:
| Maybe I'm not seeing it, but why do none of these "postgres
| durable packages" ever integrate with existing transactions?
| That's the biggest flaw of temporal, and seems so obvious that
| they can hook into transactions to ensure that starting a
| workflow is transactionally secure with other operations, and you
| don't have to manage idempotency yourself (or worry about
| handling "already started" errors and holding up DB connections
| bc you launched in transaction)
| KraftyOne wrote:
| (DBOS co-founder here) DBOS does exactly this! From the post:
|
| DBOS has a special @DBOS.Transaction decorator. This runs the
| entire step inside a Postgres transaction. This guarantees
| exactly-once execution for databases transactional steps.
| dangoodmanUT wrote:
| Sorry, i mean with external transactions to the workflow
| steps. Like I can select, insert, and launch a workflow in a
| HTTP handler
| KraftyOne wrote:
| Yeah, you can launch workflows directly from an HTTP
| handler. So here's some code that idempotently launches a
| background task from a FastAPI endpoint:
| @app.get("/background/{task_id}/{n}") def
| launch_background_task(task_id: str, n: int) -> None:
| with SetWorkflowID(task_id): # Set an idempotency key
| DBOS.start_workflow(background_task, n) # Start the
| workflow in the background
|
| Does that answer your question?
| yandie wrote:
| Temporal can use postgres as a backend. What makes this product
| different?
| dangoodmanUT wrote:
| Seems like you didn't read any of it
| KraftyOne wrote:
| DBOS Co-founder here. Great question! The big difference is
| that DBOS runs as a library inside your program, whereas the
| Temporal architecture is an external workflow server managing
| tasks on distributed workers. The advantages of DBOS are:
|
| 1. Simpler architecturally. Just a normal Python process versus
| a workflow server coordinating multiple workers.
|
| 2. Easier dev/debugging. A Temporal program is essentially
| distributed microservices, with control flow split between the
| workflow server and workers. That complicates any
| testing/debugging because you have to work through multiple
| components. Whereas DBOS is just your process.
|
| 3. Performance. A state transition in DBOS requires only a
| database write (~1 ms) whereas in Temporal it requires an async
| dispatch from the workflow server (tens-hundreds of ms).
| abtinf wrote:
| Note that it is possible to embed temporal server into a
| program. I wrote a simple demo that embeds
| client/worker/server all into one Go app. It should be
| straightforward to modify this to support clustering.
|
| https://github.com/abtinf/temporal-a-
| day/blob/main/001-all-i...
| exceptione wrote:
| Bit disappointed, looked for .net core support but no.
|
| Languages that are supported: Typescript and Python. I know
| programming languages as a topic is as inflammable as religion,
| but boy do I feel sad that these two are considered the most
| important these days. For server applications.
|
| Anyways, can people here recommend alternatives with bindings for
| .net core?
| jedberg wrote:
| Interesting, I don't think we've ever gotten a request for .net
| support. Our two most popular asks after PY and TS are Golang
| and Java, which basically tracks with the StackOverflow
| language survey:
|
| https://survey.stackoverflow.co/2024/technology#most-popular...
| nawgz wrote:
| TypeScript has the best developer ergonomics
|
| Python has
|
| So it's obvious why people would only support these two
| languages even when their performance isn't best in class
| adamgordonbell wrote:
| I built a small side thing using DBOS ( using python SDK) and the
| ergonomics were pretty nice.
|
| Then I found out Qian Li and Peter Kraft from that team are
| sharing breakdowns of interesting database research papers on
| twitter and I've been following them for that.
|
| https://x.com/petereliaskraft/status/1862937787420295672
| hotpocket777 wrote:
| Does DBOS offer a way to get messages in or data out of running
| workflows? (Similar to signals/queries in Temporal) Interested in
| long-running workflows, and this particular area seemed to be
| lacking last time I looked into it.
|
| I don't want to _just_ sleep; I want a workflow to be able to
| respond to an event.
| KraftyOne wrote:
| Yes, absolutely, that's an extremely important use case! In
| DBOS you can send messages to workflows and workflows can
| publish events that others can read. It's all backed by
| Postgres (LISTEN/NOTIFY under the hood).
|
| Documentation: https://docs.dbos.dev/python/tutorials/workflow-
| tutorial#wor...
|
| Here's a demo e-commerce application that uses messages and
| events to build an interactive long-running checkout workflow:
| https://docs.dbos.dev/python/examples/widget-store
| antics wrote:
| > # Exactly-once execution
|
| > DBOS has a special @DBOS.Transaction decorator. This runs the
| entire step inside a Postgres transaction. This guarantees
| exactly-once execution for databases transactional steps.
|
| Totally awesome, great work, just a small note... IME a lot of
| (most?) pg deployments have synchronous replication turned off
| because it is very tricky to get it to perform well[1]. If you
| have it turned off, pg could journal the step, formally
| acknowledge it, and then (as I understand DBOS) totally lose that
| journal when the primary fails, causing you to re-run the step.
|
| When I was on call for pg last, failover with some data loss
| happened to me twice. So it does happen. I think this is worth
| noting because if you plan for this to be a hard requirement,
| (unless I'm mistaken) you need to set up sync replication or you
| need to plan for this to possibly fail.
|
| Lastly, note that the pg docs[1] have this to say about sync
| replication:
|
| > Synchronous replication usually requires carefully planned and
| placed standby servers to ensure applications perform acceptably.
| Waiting doesn't utilize system resources, but transaction locks
| continue to be held until the transfer is confirmed. As a result,
| incautious use of synchronous replication will reduce performance
| for database applications because of increased response times and
| higher contention.
|
| I see the DBOS author around here somewhere so if the state of
| the art for DBOS has changed please do let me know and I'll
| correct the comment.
|
| [1] https://www.postgresql.org/docs/current/warm-
| standby.html#SY...
| KraftyOne wrote:
| Yeah, that's totally fair--DBOS is totally built on Postgres,
| so it can't provide stronger durability guarantees than your
| Postgres does. If Postgres loses data, then DBOS can lose data
| too. There's no way around that if you're using Postgres for
| data storage, no matter how you architect the system.
| efxhoy wrote:
| Interesting! Is there anything like this that I could host
| myself?
| KraftyOne wrote:
| Yeah, the core DBOS library is totally open-source and you can
| run it anywhere as long as it has a Postgres to connect to.
| Check it out: https://github.com/dbos-inc/dbos-transact-py
| abelanger wrote:
| Disclaimer: I'm a co-founder of Hatchet
| (https://github.com/hatchet-dev/hatchet), which is a Postgres-
| backed task queue that supports durable execution.
|
| > Because a step transition is just a Postgres write (~1ms)
| versus an async dispatch from an external orchestrator (~100ms),
| it means DBOS is 25x faster than AWS Step Functions
|
| Durable execution engines deployed as an external orchestrator
| will always been slower than direct DB writes, but the 1ms delay
| versus ~100ms doesn't seem inherent to the orchestrator being
| external. In the case of Hatchet, pushing work takes ~15ms and
| invoking the work takes ~1ms if deployed in the same VPC, and 90%
| of that execution time is on the database. In the best-case, the
| external orchestrator should take 2x as long to write a step
| transition (round-trip network call to the orchestrator +
| database write), so an ideal external orchestrator would be ~2ms
| of latency here.
|
| There are also some tradeoffs to a library-only mode that aren't
| discussed. How would work that requires global coordination
| between workers behave in this model? Let's say, for example, a
| global rate limit -- you'd ideally want to avoid contention on
| rate limit rows, assuming they're stored in Postgres, but each
| worker attempting to acquire a rate limit simultaneously would
| slow down start time significantly (and place additional load on
| the DB). Whereas with a single external orchestrator (or leader
| election), you can significantly increase throughput by acquiring
| rate limits as part of a push-based assignment process.
|
| The same problem of coordination arises if many workers are
| competing for the same work -- for example if a machine crashes
| while doing work, as described in the article. I'm assuming
| there's some kind of polling happening which uses FOR UPDATE SKIP
| LOCKED, which concerns me as you start to scale up the number of
| workers.
___________________________________________________________________
(page generated 2024-12-10 23:00 UTC)