[HN Gopher] Temporal Python - A durable, distributed asyncio eve...
       ___________________________________________________________________
        
       Temporal Python - A durable, distributed asyncio event loop (2023)
        
       Author : metadat
       Score  : 179 points
       Date   : 2024-05-07 15:56 UTC (2 days ago)
        
 (HTM) web link (temporal.io)
 (TXT) w3m dump (temporal.io)
        
       | metadat wrote:
       | Credit: @kodablah and @chippiewill, thanks for turning me into
       | this!
       | 
       | https://news.ycombinator.com/item?id=40282650
        
       | bmitc wrote:
       | As someone familiar with asyncio, I don't understand what this is
       | or what it's for. What's an activity, workflow, or worker?
       | 
       | > See the asyncio.sleep in there? That's no normal local-process
       | sleep; that's a durable timer backed by Temporal.
       | 
       | That's the normal asyncio.sleep. What does backed by Temporal
       | mean? Reading further, it appears that Temporal is replacing the
       | default asyncio event loop. I don't understand why every third
       | party async Python library/framework feels the need to take over
       | the default event loop instead of just building on top of it.
        
         | adhamsalama wrote:
         | I think it means that your code could be resumed on a different
         | machine.
        
         | storyinmemo wrote:
         | Temporal is completely above and beyond Asyncio. It's a full
         | scheduling of work and queues that's cross-machine, cross-
         | language, and very transparent.
         | 
         | A workflow is the code that handles only deterministic actions
         | and calls activities.
         | 
         | Activities are functions that do anything you want, typically
         | affecting other systems with network or file calls.
         | 
         | A worker is the running process connected to Temporal with
         | registered workflows & activities for it to pass work to.
         | 
         | I'm doing a lot of work with alert handling and provisioning
         | systems using Temporal. Temporal in two minutes is a great
         | video explanation: https://www.youtube.com/watch?v=f-18XztyN6c
        
         | uniqueuid wrote:
         | The thing with event loops in python is that they are not a
         | single, all-governing scheduler (as e.g. in the BEAM).
         | 
         | ev loops instead are a mid-layer concept that sits below other
         | infrastructure such as threads and processes. And (perhaps
         | somewhat frustratingly) it is not too uncommon to have multiple
         | ev loops in parallel. See for example the proxy.py project,
         | which offers to run one async loop per process for a speedup.
         | 
         | As a result, there are some incentives to swap out the loop
         | itself, e.g. for faster implementations like uvloop, because
         | they are somewhat pluggable anyways.
        
           | mapcars wrote:
           | > all-governing scheduler (as e.g. in the BEAM).
           | 
           | Does this also mean they have preemptive multitasking like in
           | BEAM?
        
           | blegr wrote:
           | Good design dictates that you start one loop and build the
           | whole program around it, no? The docs for asyncio.run say as
           | much.
        
             | kodablah wrote:
             | Yes, that is good design and the event loop should
             | basically be shared process-wide (asyncio objects are
             | usually not thread safe and cannot be shared across event
             | loops). Temporal only does custom event loops in isolated
             | workflows.
        
         | helpfulContrib wrote:
         | I just had to write a Python program that handled multiple
         | async events - from a serial line, and with a tkinter GUI. The
         | only way to make it 'truly' async was to handle the runloops
         | myself, and add a separate queue, onto which I push coroutines,
         | for processing. UI events and Serial I/O events (involving
         | passing of messages to update states on both sides) all have to
         | be pushed through the same mechanism in order to gain the
         | functionality I need.
         | 
         | Sure, I 'could just use asyncio', or somehow work out how to
         | crowbar serial i/o into tkinters' runloop. But in the end,
         | writing my own just made more sense, and more importantly: it
         | works great. I can have serial i/o and UI acting independently,
         | but coordinating through a single queue .. this works so well.
         | 
         | >I don't understand why every third party async Python
         | library/framework feels the need to take over the default event
         | loop instead of just building on top of it.
         | 
         | Because you don't always have what you need to get the crowbar
         | in place, nor big enough leverage to make space for what you
         | have to do, asynchronously, in the app.
        
           | bmitc wrote:
           | But that can indeed just be done with the standard `asyncio`
           | loop. You run your GUI in a thread, run the `asyncio` event
           | loop in its own thread, pass the `asyncio` loop messages with
           | an `asyncio.Queue` and `asyncio.run_coroutine_threadsafe`,
           | and then use `asyncio.to_thread` for the serial communication
           | within the `asyncio` event loop.
        
         | kodablah wrote:
         | To add to other responses here, Temporal doesn't take over the
         | default event loop in general (and users still use it for
         | clients and activities and such). Temporal workflows must be
         | deterministic and durable which means they are guaranteed to
         | run and are resumable on other machines. Therefore Temporal
         | workflows specifically operate on a custom event loop
         | implementation. It doesn't affect anything outside the
         | workflow.
        
         | robertlagrant wrote:
         | It's like that old Joel Spolsky article[0] says:
         | 
         | > you only have to get one supergenius to write the hard code
         | to run map and reduce on a global massively parallel array of
         | computers, and all the old code that used to work fine when you
         | just ran a loop still works only it's a zillion times faster
         | which means it can be used to tackle huge problems in an
         | instant
         | 
         | If you can replace the thing that people use with a distributed
         | version, then that can make it easy to write distributed code.
         | 
         | [0] https://www.joelonsoftware.com/2006/08/01/can-your-
         | programmi...
        
         | davepeck wrote:
         | The OP seems targeted at devs who are already quite familiar
         | with Temporal and are interested in using the new Python
         | exposure.
         | 
         | FWIW, as someone who has never previously encountered Temporal,
         | and has only a vague sense of the specific problem set it's
         | trying to tackle and architectural approach it has taken, I
         | find the post to be fairly impenetrable.
         | 
         | I'd love to read a proper introduction to Temporal by way of
         | Python (and probably also by comparison to Celery).
        
       | avi_vallarapu wrote:
       | Relying on external APIs or databases within activities might
       | lead to variability in workflow execution.
       | 
       | Also, on handling HTTP errors in activities by raising an
       | "ApplicationError" based on the status code, might simplifies
       | error handling but might need to see how it accounts for more
       | complex scenarios where errors are transient or where a retry
       | could be successful even for some client errors like rate
       | limiting or temporary unavailability etc.
       | 
       | As the asyncio library itself does have a steep learning curve,
       | integration of asyncio with workflow systems like Temporal that
       | also uses Pythons native asynchronous features, developers should
       | be careful about indirect or subtle bugs, especially in error
       | handling and task management.
        
         | kodablah wrote:
         | > Relying on external APIs or databases within activities might
         | lead to variability in workflow execution.
         | 
         | This is why they are activities. Their results are stored in
         | history, the workflow remains deterministic.
         | 
         | > might need to see how it accounts for more complex scenarios
         | where errors are transient or where a retry could be successful
         | even for some client errors like rate limiting or temporary
         | unavailability etc.
         | 
         | Temporal allows you to specify whether an error is retryable or
         | not.
        
       | KaiserPro wrote:
       | Isn't this just threads but with more surprise gotos?
        
         | ikari_pl wrote:
         | no, the whole point of temporal is to distribute work across
         | machines, but without worrying too much on the orchestration.
         | 
         | workflows and activities are called remotely, and you can have
         | an autoscaled worker pool handling these calls.
         | 
         | you can retry any unit easily on failure and specify the non
         | retryable errors. What it requires in exchange is full
         | determinism - the same input should produce the same activities
         | in the same order, as a good starting point.
         | 
         | src: I'm a user since over a year ago.
        
         | anonzzzies wrote:
         | Threads are durable nor distributed.
        
       | benakh wrote:
       | Anyone migrated from celery, with / without regrets?
        
         | kodablah wrote:
         | Many Temporal users used Celery in the past. There was a
         | popular blog post a while back about issues with celery:
         | https://steve.dignam.xyz/2023/05/20/many-problems-with-
         | celer.... Here's a brief heading-by-heading listing of how
         | Temporal addresses those issues:
         | https://community.temporal.io/t/suggestion-for-blog-post-
         | abo....
         | 
         | (disclaimer, I'm the author of the post)
        
           | Izkata wrote:
           | The "API isn't Pythonic" examples are misleading, the first
           | and third are using more verbose forms of:
           | add.delay(1, 2)
           | 
           | The verbose forms are for when you want extra functionality
           | like in the second example.
           | 
           | It's relatively small compared to the other issues but it
           | sticks out because it's one of only two listed as "you'll
           | have to live with it".
        
             | kodablah wrote:
             | (to clarify my ambiguous disclaimer, I am the author of
             | OP's Temporal post, not the Celery one)
        
         | 015a wrote:
         | We migrated from an in-house redis queuing system.
         | 
         | Temporal has its own way of doing things; there's rules about
         | what you can and cant do in workflows, what has to live in
         | activities, etc. Its generally quite easy to adapt existing
         | code work with it. We use typescript.
         | 
         | The worst part for us has been error/anomaly handling.
         | Workflows can sometimes hit a state where the status reads in
         | progress and errors aren't reported anywhere except buried in
         | the event log; which surfaces great in the UI but we still
         | haven't figured out how to programmatically respond to this
         | condition.
         | 
         | A good example is: we use a home-grown version of this [1] to
         | proxy large payloads to S3. However, if those payloads get
         | REALLY large, they can take some time to upload and download;
         | and if that "some time" is longer than 5 seconds, the control
         | plane will believe that the worker has died, it won't
         | reschedule, and the workflow just sits in In Progress. There's
         | always a beautiful error on the temporal dashboard, and we can
         | manually terminate/retry, but the world just seems to die when
         | this happens and we can't do error-level cleanup stuff like
         | alert the user that the thing they were doing didn't finish.
         | 
         | Temporal is also challenging to get support for. Its new, open
         | source, we don't pay for temporal cloud, and there's not a ton
         | of resources or people using it. The documentation is quite bad
         | (if you like 500,000 word pages, codegen'd library sites with
         | no comments, and one example for each feature, you'll like
         | their documentation). Given we run our own temporal cluster,
         | we've also had pretty large challenges in the self-hosting
         | world. We work through them, usually after deep-diving into the
         | temporal server code itself, but there's startlingly little
         | documentation on self-hosting, and even less community support.
         | 
         | Overall, we don't regret adopting it, but if we had a time
         | machine we wouldn't do it again. I feel it makes a series of
         | sacrifices in order to create a system that has extremely high
         | standards for processing, like financial/bank/healthcare level
         | stuff. But, not only are we not building that, but the system
         | has never behaved in a way which makes me think I'd even want
         | to use it if I worked in those industries. Obviously I feel
         | like I'm the one in the wrong here, and I'm sure its just a
         | matter of "we screwed up something somewhere", but that leads
         | back to: bad documentation, no way to get professional support
         | without being on their cloud, and a lack of community support.
         | 
         | [1] https://github.com/DataDog/temporal-large-payload-codec
        
           | no_wizard wrote:
           | Would you not like it if you didn't self host?
           | 
           | If I'm being honest if it is a big issue to self host but
           | it's value to developers is obvious and apparent why not pay?
        
             | 015a wrote:
             | Nah we'd probably be fine paying temporal cloud to host the
             | control plane. Their billing is a little weird; I know
             | quite a bit about temporal-the-technology, and the pricing
             | page is literally the first time I've ever seen the word
             | "action" used. I'm familiar with workflows, activities,
             | sinks, codecs, events, but not actions; so when they bill
             | $N/million actions I have no idea what that means, and its
             | surprising to me that _that 's_ how they bill it. But I'm
             | sure there's an answer somewhere.
             | 
             | Temporal Cloud is really, really new. Like, it was in some
             | kind of closed beta for a while, with a "contact us" form,
             | as recently as a couple months ago? So, the main reason we
             | don't use it is because it simply wasn't available. It
             | looks like its more widely available now though.
        
       | 7bit wrote:
       | Is this equal to Azure Durable Functions?
        
         | kodablah wrote:
         | Not necessarily "equal" but the basic premise is the same, yes,
         | and there is a common lineage. Azure Durable Functions sits on
         | Azure Durable Task Framework which was created by the co-
         | founder of Temporal (https://temporal.io/about).
         | 
         | (disclaimer, I'm the author of the post)
        
           | 7bit wrote:
           | Ohh, that's great to hear! I do like ADF, but the Python
           | worker is full of bugs and weird behaviour and tickets stay
           | open for month without progress. I will definitely check that
           | out!
        
             | rjbwork wrote:
             | During an evaluation I found a bug in their library. I went
             | to their Slack, posted about it, and they gave a workaround
             | in 15 minutes, created an issue in 30, and had a bugfix PR
             | ready the next day. Pretty impressed with their team.
             | 
             | Excited to get to use it at some point.
        
       | pierrebai wrote:
       | Read the example code, have a sinking feeling that is not taken
       | from a real tested example. Either there are multiple unexplained
       | symbols or teh code does not actually run.
       | 
       | For example, in "Implementing a Workflow" the execute_activity
       | refers to Purchaser.purchase, which is not declared anywhere.
       | 
       | If the execute_activity times-out after 1 minutes, the status
       | does not seem to be updated anywhere.
       | 
       | In "Running a Worker", do_purchaser is passed as an activity,
       | without explanation. (I guess I'd need to read the fundamental
       | Temporal docs?)
        
         | kodablah wrote:
         | Yes, it has undergone revisions since which caused function
         | name mismatch (EDIT: fixed). The execute_activity there uses
         | start_to_close_timeout which is per attempt and will retry
         | forever by default (customizable).
         | 
         | This is more of a primer on the Python part of Temporal rather
         | than an explanation of all Temporal concepts in depth.
         | Definitely would recommend reading the fundamental docs at
         | https://docs.temporal.io/encyclopedia/. For more exact samples,
         | see https://github.com/temporalio/samples-python.
        
       | sscarduzio wrote:
       | I can't understand what layer provides the state orchestration.
       | Like, in celery is redis. What about here?
        
         | kodablah wrote:
         | The Temporal server stores events and distributes tasks. There
         | is a cloud offering or it can be self-hosted (with support for
         | Cassandra, Postgres, MySQL, and SQLite persistence). This post
         | focuses more on the Temporal Python SDK and not the general
         | platform.
        
           | kcorbitt wrote:
           | Could you or anyone else with experience with Temporal share
           | how hard it is to self-host in practice? Like, is this more
           | like Redis (self-hosting is trivial) or Supabase (nominally
           | self-hostable, but if you try to do it you'll quickly realize
           | it's a pain and the happy path is to use their hosted
           | platform).
        
             | kodablah wrote:
             | We offer a full guide to help here at
             | https://docs.temporal.io/self-hosted-guide and many users
             | of all sizes self-host Temporal. Having said that, it has
             | challenges as does running any high-available production
             | system. We offer cloud to ease this burden. You still run
             | all your code/workers and you can end-to-end encrypt all
             | data.
        
       | troebr wrote:
       | I found out about Temporal 2+y ago now and were early adopters of
       | their cloud. It's a bit of a paradigm shift when you start using
       | it, but it is amazing at solving some types of problems in a very
       | simple manner. There are some trade-offs, there always are, for
       | us that was migrating long lived workflows. But the resulting
       | simplicity and maintainability of the code has been great. One
       | thing that was hard with Temporal was to "sell it" to business
       | leaders because it's not a turn key solution, it's more like a
       | piece of infra for engineers to build on top of. Kind of a higher
       | level database-queue-workflow engine thing that simplifies work
       | for engineers.
       | 
       | In short we were working on automating things like onboarding a
       | new employee, which involves creating accounts for their saas
       | apps, buying and shipping their device, email confirmations,
       | satisfactions surveys etc. So a workflow could last up to 3
       | months with some fully automated systems, and some that required
       | integrating with people (listening to jira event to trigger
       | things, etc).
       | 
       | The error handling was the thing that sold me on Temporal,
       | because things can break just about anywhere in unpredictable
       | ways (not just code, can be process, employee quits during the
       | onboarding, customer is out of licenses etc), so we need
       | everything to be robust and be fixable by a person. With
       | homegrown queue based systems or with BPEL it can be hard
       | handling these situations (what if you need to roll back 3
       | steps?). With code you can use exceptions, write unit tests etc.
       | We use the typescript sdk, promises made it very intuitive to
       | code even some otherwise complicated scenarios (say event
       | listeners etc).
        
       | siliconc0w wrote:
       | I'd question if you really need to distribute work across
       | machines. It's great this makes distributed systems easier to
       | write but it's much better to reject the premise and avoid
       | writing them in the first place.
        
         | dandandan wrote:
         | Distributing the work across machines essentially comes for
         | free with the ability to replay workflows from any step. You
         | could run all of your worker processes on a single machine if
         | it had enough capacity, but resuming a workflow on a different
         | machine is transparent to your workloads assuming there's no
         | local state.
        
       | amackera wrote:
       | I love Temporal-- we use it at my company. It's very very good
       | for our use case, but took a while to understand how to use it.
       | We're still figuring things out (Workflow versioning is one thing
       | we suck at still).
       | 
       | That said, I'm not sure why this post from 2023 was posted here
       | today. There've been multiple updates to the Python SDK since
       | this post.
        
         | swyx wrote:
         | > Workflow versioning is one thing we suck at still
         | 
         | well its not entirely your fault :) what practices have you
         | adopted now that you have some experience with it?
         | 
         | lower down OP mentions that they got the link from a HN
         | discussion on asyncio 2 days ago
         | https://news.ycombinator.com/item?id=40287354 . i guess the
         | upvotes are today's lucky 10,000 learning about it for the
         | first time.
        
       | addisonj wrote:
       | It is interesting seeing the comments here, the comments from
       | adopters is there is a lot of value but it taking time to get up
       | to speed. Those new to temporal a lot of questions seeking
       | understanding.
       | 
       | I have spent a lot of time in the adjacent space of event driven
       | systems and there, like here, it seems like some of the biggest
       | challenge is just education.
       | 
       | I wouldn't say that EDA or workflow based systems are preferable
       | to traditional API services with DBs, just that the space they
       | occupy in the industry is so large that I think it is really
       | really hard if you to introducing any different paradigms, even
       | when you focus on domains where API services aren't a great fit
       | (like here with long running, complex operations).
       | 
       | My point with this comment is simply that I think if you are
       | trying to build anything that does things differently, developer
       | education is as important or even more important than design and
       | architecture, but often not considered because those building
       | these systems are already so deep into it that they can't
       | approach the problem as an outsider.
        
       | BiteCode_dev wrote:
       | I like the concept.
       | 
       | The nice things is that it abstract the conditions checks on
       | whether something is done, has succeeded or should be retried.
       | 
       | The bad things is that it abstract the conditions checks on
       | whether something is done, has succeeded or should be retried.
       | 
       | It's nice because that's something you do again and again, and
       | that's a lot of code. A lot of ways it can go wrong.
       | 
       | But it's bad because that's a huge chunk of black box magic that
       | may execute remotely. If you need a custom or more optimized
       | behavior anywhere in this logic, you are done for. If there is a
       | bug/problem in this logic, it's game over. I also have to imagine
       | debugging and error reporting is likely not super fun.
       | 
       | One point in particular that strikes me, is that impotence is
       | generically guaranteed with something like "has this task
       | executed without error last time". But usually, what I want is
       | something much more specific, like "has that entries been
       | updated", "has that file been created" and so on. From a bird
       | view, it looks the same, but from a system reliability point of
       | you, they are not at all the same.
       | 
       | Hard to see how they avoid duplicate results, overlapping tasks,
       | etc.
       | 
       | I don't think they really can at that level of abstraction, which
       | means you need to implement it manually.
       | 
       | Eventually it seems it's a huge dep to bring in for the actual
       | practical problem is really solves well.
       | 
       | But I'm willing to be proven wrong on this one, because the tech
       | is really damn cool.
        
       | fortylove wrote:
       | I find Temporal itself to be effectively a clone of Amazon's
       | Simple Workflow (SWF)
        
         | kodablah wrote:
         | That's no coincidence, Temporal is founded by the creators of
         | Amazon Simple Workflow. See https://temporal.io/about.
        
       | ub-volta-toss wrote:
       | Temporal is really neat but I think its marketed at too many use
       | cases.
       | 
       | After a year of high-scale Temporal work, I found it was only
       | good for low-scale work.
       | 
       | The onboarding and learning curve were insanely difficult and
       | complex. Ultimately it doesn't scale as well as you think. The
       | temporal team invented their own database to get around this
       | limitation.
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:01 UTC)