[HN Gopher] Temporal Python - A durable, distributed asyncio eve...
___________________________________________________________________
Temporal Python - A durable, distributed asyncio event loop (2023)
Author : metadat
Score : 179 points
Date : 2024-05-07 15:56 UTC (2 days ago)
(HTM) web link (temporal.io)
(TXT) w3m dump (temporal.io)
| metadat wrote:
| Credit: @kodablah and @chippiewill, thanks for turning me into
| this!
|
| https://news.ycombinator.com/item?id=40282650
| bmitc wrote:
| As someone familiar with asyncio, I don't understand what this is
| or what it's for. What's an activity, workflow, or worker?
|
| > See the asyncio.sleep in there? That's no normal local-process
| sleep; that's a durable timer backed by Temporal.
|
| That's the normal asyncio.sleep. What does backed by Temporal
| mean? Reading further, it appears that Temporal is replacing the
| default asyncio event loop. I don't understand why every third
| party async Python library/framework feels the need to take over
| the default event loop instead of just building on top of it.
| adhamsalama wrote:
| I think it means that your code could be resumed on a different
| machine.
| storyinmemo wrote:
| Temporal is completely above and beyond Asyncio. It's a full
| scheduling of work and queues that's cross-machine, cross-
| language, and very transparent.
|
| A workflow is the code that handles only deterministic actions
| and calls activities.
|
| Activities are functions that do anything you want, typically
| affecting other systems with network or file calls.
|
| A worker is the running process connected to Temporal with
| registered workflows & activities for it to pass work to.
|
| I'm doing a lot of work with alert handling and provisioning
| systems using Temporal. Temporal in two minutes is a great
| video explanation: https://www.youtube.com/watch?v=f-18XztyN6c
| uniqueuid wrote:
| The thing with event loops in python is that they are not a
| single, all-governing scheduler (as e.g. in the BEAM).
|
| ev loops instead are a mid-layer concept that sits below other
| infrastructure such as threads and processes. And (perhaps
| somewhat frustratingly) it is not too uncommon to have multiple
| ev loops in parallel. See for example the proxy.py project,
| which offers to run one async loop per process for a speedup.
|
| As a result, there are some incentives to swap out the loop
| itself, e.g. for faster implementations like uvloop, because
| they are somewhat pluggable anyways.
| mapcars wrote:
| > all-governing scheduler (as e.g. in the BEAM).
|
| Does this also mean they have preemptive multitasking like in
| BEAM?
| blegr wrote:
| Good design dictates that you start one loop and build the
| whole program around it, no? The docs for asyncio.run say as
| much.
| kodablah wrote:
| Yes, that is good design and the event loop should
| basically be shared process-wide (asyncio objects are
| usually not thread safe and cannot be shared across event
| loops). Temporal only does custom event loops in isolated
| workflows.
| helpfulContrib wrote:
| I just had to write a Python program that handled multiple
| async events - from a serial line, and with a tkinter GUI. The
| only way to make it 'truly' async was to handle the runloops
| myself, and add a separate queue, onto which I push coroutines,
| for processing. UI events and Serial I/O events (involving
| passing of messages to update states on both sides) all have to
| be pushed through the same mechanism in order to gain the
| functionality I need.
|
| Sure, I 'could just use asyncio', or somehow work out how to
| crowbar serial i/o into tkinters' runloop. But in the end,
| writing my own just made more sense, and more importantly: it
| works great. I can have serial i/o and UI acting independently,
| but coordinating through a single queue .. this works so well.
|
| >I don't understand why every third party async Python
| library/framework feels the need to take over the default event
| loop instead of just building on top of it.
|
| Because you don't always have what you need to get the crowbar
| in place, nor big enough leverage to make space for what you
| have to do, asynchronously, in the app.
| bmitc wrote:
| But that can indeed just be done with the standard `asyncio`
| loop. You run your GUI in a thread, run the `asyncio` event
| loop in its own thread, pass the `asyncio` loop messages with
| an `asyncio.Queue` and `asyncio.run_coroutine_threadsafe`,
| and then use `asyncio.to_thread` for the serial communication
| within the `asyncio` event loop.
| kodablah wrote:
| To add to other responses here, Temporal doesn't take over the
| default event loop in general (and users still use it for
| clients and activities and such). Temporal workflows must be
| deterministic and durable which means they are guaranteed to
| run and are resumable on other machines. Therefore Temporal
| workflows specifically operate on a custom event loop
| implementation. It doesn't affect anything outside the
| workflow.
| robertlagrant wrote:
| It's like that old Joel Spolsky article[0] says:
|
| > you only have to get one supergenius to write the hard code
| to run map and reduce on a global massively parallel array of
| computers, and all the old code that used to work fine when you
| just ran a loop still works only it's a zillion times faster
| which means it can be used to tackle huge problems in an
| instant
|
| If you can replace the thing that people use with a distributed
| version, then that can make it easy to write distributed code.
|
| [0] https://www.joelonsoftware.com/2006/08/01/can-your-
| programmi...
| davepeck wrote:
| The OP seems targeted at devs who are already quite familiar
| with Temporal and are interested in using the new Python
| exposure.
|
| FWIW, as someone who has never previously encountered Temporal,
| and has only a vague sense of the specific problem set it's
| trying to tackle and architectural approach it has taken, I
| find the post to be fairly impenetrable.
|
| I'd love to read a proper introduction to Temporal by way of
| Python (and probably also by comparison to Celery).
| avi_vallarapu wrote:
| Relying on external APIs or databases within activities might
| lead to variability in workflow execution.
|
| Also, on handling HTTP errors in activities by raising an
| "ApplicationError" based on the status code, might simplifies
| error handling but might need to see how it accounts for more
| complex scenarios where errors are transient or where a retry
| could be successful even for some client errors like rate
| limiting or temporary unavailability etc.
|
| As the asyncio library itself does have a steep learning curve,
| integration of asyncio with workflow systems like Temporal that
| also uses Pythons native asynchronous features, developers should
| be careful about indirect or subtle bugs, especially in error
| handling and task management.
| kodablah wrote:
| > Relying on external APIs or databases within activities might
| lead to variability in workflow execution.
|
| This is why they are activities. Their results are stored in
| history, the workflow remains deterministic.
|
| > might need to see how it accounts for more complex scenarios
| where errors are transient or where a retry could be successful
| even for some client errors like rate limiting or temporary
| unavailability etc.
|
| Temporal allows you to specify whether an error is retryable or
| not.
| KaiserPro wrote:
| Isn't this just threads but with more surprise gotos?
| ikari_pl wrote:
| no, the whole point of temporal is to distribute work across
| machines, but without worrying too much on the orchestration.
|
| workflows and activities are called remotely, and you can have
| an autoscaled worker pool handling these calls.
|
| you can retry any unit easily on failure and specify the non
| retryable errors. What it requires in exchange is full
| determinism - the same input should produce the same activities
| in the same order, as a good starting point.
|
| src: I'm a user since over a year ago.
| anonzzzies wrote:
| Threads are durable nor distributed.
| benakh wrote:
| Anyone migrated from celery, with / without regrets?
| kodablah wrote:
| Many Temporal users used Celery in the past. There was a
| popular blog post a while back about issues with celery:
| https://steve.dignam.xyz/2023/05/20/many-problems-with-
| celer.... Here's a brief heading-by-heading listing of how
| Temporal addresses those issues:
| https://community.temporal.io/t/suggestion-for-blog-post-
| abo....
|
| (disclaimer, I'm the author of the post)
| Izkata wrote:
| The "API isn't Pythonic" examples are misleading, the first
| and third are using more verbose forms of:
| add.delay(1, 2)
|
| The verbose forms are for when you want extra functionality
| like in the second example.
|
| It's relatively small compared to the other issues but it
| sticks out because it's one of only two listed as "you'll
| have to live with it".
| kodablah wrote:
| (to clarify my ambiguous disclaimer, I am the author of
| OP's Temporal post, not the Celery one)
| 015a wrote:
| We migrated from an in-house redis queuing system.
|
| Temporal has its own way of doing things; there's rules about
| what you can and cant do in workflows, what has to live in
| activities, etc. Its generally quite easy to adapt existing
| code work with it. We use typescript.
|
| The worst part for us has been error/anomaly handling.
| Workflows can sometimes hit a state where the status reads in
| progress and errors aren't reported anywhere except buried in
| the event log; which surfaces great in the UI but we still
| haven't figured out how to programmatically respond to this
| condition.
|
| A good example is: we use a home-grown version of this [1] to
| proxy large payloads to S3. However, if those payloads get
| REALLY large, they can take some time to upload and download;
| and if that "some time" is longer than 5 seconds, the control
| plane will believe that the worker has died, it won't
| reschedule, and the workflow just sits in In Progress. There's
| always a beautiful error on the temporal dashboard, and we can
| manually terminate/retry, but the world just seems to die when
| this happens and we can't do error-level cleanup stuff like
| alert the user that the thing they were doing didn't finish.
|
| Temporal is also challenging to get support for. Its new, open
| source, we don't pay for temporal cloud, and there's not a ton
| of resources or people using it. The documentation is quite bad
| (if you like 500,000 word pages, codegen'd library sites with
| no comments, and one example for each feature, you'll like
| their documentation). Given we run our own temporal cluster,
| we've also had pretty large challenges in the self-hosting
| world. We work through them, usually after deep-diving into the
| temporal server code itself, but there's startlingly little
| documentation on self-hosting, and even less community support.
|
| Overall, we don't regret adopting it, but if we had a time
| machine we wouldn't do it again. I feel it makes a series of
| sacrifices in order to create a system that has extremely high
| standards for processing, like financial/bank/healthcare level
| stuff. But, not only are we not building that, but the system
| has never behaved in a way which makes me think I'd even want
| to use it if I worked in those industries. Obviously I feel
| like I'm the one in the wrong here, and I'm sure its just a
| matter of "we screwed up something somewhere", but that leads
| back to: bad documentation, no way to get professional support
| without being on their cloud, and a lack of community support.
|
| [1] https://github.com/DataDog/temporal-large-payload-codec
| no_wizard wrote:
| Would you not like it if you didn't self host?
|
| If I'm being honest if it is a big issue to self host but
| it's value to developers is obvious and apparent why not pay?
| 015a wrote:
| Nah we'd probably be fine paying temporal cloud to host the
| control plane. Their billing is a little weird; I know
| quite a bit about temporal-the-technology, and the pricing
| page is literally the first time I've ever seen the word
| "action" used. I'm familiar with workflows, activities,
| sinks, codecs, events, but not actions; so when they bill
| $N/million actions I have no idea what that means, and its
| surprising to me that _that 's_ how they bill it. But I'm
| sure there's an answer somewhere.
|
| Temporal Cloud is really, really new. Like, it was in some
| kind of closed beta for a while, with a "contact us" form,
| as recently as a couple months ago? So, the main reason we
| don't use it is because it simply wasn't available. It
| looks like its more widely available now though.
| 7bit wrote:
| Is this equal to Azure Durable Functions?
| kodablah wrote:
| Not necessarily "equal" but the basic premise is the same, yes,
| and there is a common lineage. Azure Durable Functions sits on
| Azure Durable Task Framework which was created by the co-
| founder of Temporal (https://temporal.io/about).
|
| (disclaimer, I'm the author of the post)
| 7bit wrote:
| Ohh, that's great to hear! I do like ADF, but the Python
| worker is full of bugs and weird behaviour and tickets stay
| open for month without progress. I will definitely check that
| out!
| rjbwork wrote:
| During an evaluation I found a bug in their library. I went
| to their Slack, posted about it, and they gave a workaround
| in 15 minutes, created an issue in 30, and had a bugfix PR
| ready the next day. Pretty impressed with their team.
|
| Excited to get to use it at some point.
| pierrebai wrote:
| Read the example code, have a sinking feeling that is not taken
| from a real tested example. Either there are multiple unexplained
| symbols or teh code does not actually run.
|
| For example, in "Implementing a Workflow" the execute_activity
| refers to Purchaser.purchase, which is not declared anywhere.
|
| If the execute_activity times-out after 1 minutes, the status
| does not seem to be updated anywhere.
|
| In "Running a Worker", do_purchaser is passed as an activity,
| without explanation. (I guess I'd need to read the fundamental
| Temporal docs?)
| kodablah wrote:
| Yes, it has undergone revisions since which caused function
| name mismatch (EDIT: fixed). The execute_activity there uses
| start_to_close_timeout which is per attempt and will retry
| forever by default (customizable).
|
| This is more of a primer on the Python part of Temporal rather
| than an explanation of all Temporal concepts in depth.
| Definitely would recommend reading the fundamental docs at
| https://docs.temporal.io/encyclopedia/. For more exact samples,
| see https://github.com/temporalio/samples-python.
| sscarduzio wrote:
| I can't understand what layer provides the state orchestration.
| Like, in celery is redis. What about here?
| kodablah wrote:
| The Temporal server stores events and distributes tasks. There
| is a cloud offering or it can be self-hosted (with support for
| Cassandra, Postgres, MySQL, and SQLite persistence). This post
| focuses more on the Temporal Python SDK and not the general
| platform.
| kcorbitt wrote:
| Could you or anyone else with experience with Temporal share
| how hard it is to self-host in practice? Like, is this more
| like Redis (self-hosting is trivial) or Supabase (nominally
| self-hostable, but if you try to do it you'll quickly realize
| it's a pain and the happy path is to use their hosted
| platform).
| kodablah wrote:
| We offer a full guide to help here at
| https://docs.temporal.io/self-hosted-guide and many users
| of all sizes self-host Temporal. Having said that, it has
| challenges as does running any high-available production
| system. We offer cloud to ease this burden. You still run
| all your code/workers and you can end-to-end encrypt all
| data.
| troebr wrote:
| I found out about Temporal 2+y ago now and were early adopters of
| their cloud. It's a bit of a paradigm shift when you start using
| it, but it is amazing at solving some types of problems in a very
| simple manner. There are some trade-offs, there always are, for
| us that was migrating long lived workflows. But the resulting
| simplicity and maintainability of the code has been great. One
| thing that was hard with Temporal was to "sell it" to business
| leaders because it's not a turn key solution, it's more like a
| piece of infra for engineers to build on top of. Kind of a higher
| level database-queue-workflow engine thing that simplifies work
| for engineers.
|
| In short we were working on automating things like onboarding a
| new employee, which involves creating accounts for their saas
| apps, buying and shipping their device, email confirmations,
| satisfactions surveys etc. So a workflow could last up to 3
| months with some fully automated systems, and some that required
| integrating with people (listening to jira event to trigger
| things, etc).
|
| The error handling was the thing that sold me on Temporal,
| because things can break just about anywhere in unpredictable
| ways (not just code, can be process, employee quits during the
| onboarding, customer is out of licenses etc), so we need
| everything to be robust and be fixable by a person. With
| homegrown queue based systems or with BPEL it can be hard
| handling these situations (what if you need to roll back 3
| steps?). With code you can use exceptions, write unit tests etc.
| We use the typescript sdk, promises made it very intuitive to
| code even some otherwise complicated scenarios (say event
| listeners etc).
| siliconc0w wrote:
| I'd question if you really need to distribute work across
| machines. It's great this makes distributed systems easier to
| write but it's much better to reject the premise and avoid
| writing them in the first place.
| dandandan wrote:
| Distributing the work across machines essentially comes for
| free with the ability to replay workflows from any step. You
| could run all of your worker processes on a single machine if
| it had enough capacity, but resuming a workflow on a different
| machine is transparent to your workloads assuming there's no
| local state.
| amackera wrote:
| I love Temporal-- we use it at my company. It's very very good
| for our use case, but took a while to understand how to use it.
| We're still figuring things out (Workflow versioning is one thing
| we suck at still).
|
| That said, I'm not sure why this post from 2023 was posted here
| today. There've been multiple updates to the Python SDK since
| this post.
| swyx wrote:
| > Workflow versioning is one thing we suck at still
|
| well its not entirely your fault :) what practices have you
| adopted now that you have some experience with it?
|
| lower down OP mentions that they got the link from a HN
| discussion on asyncio 2 days ago
| https://news.ycombinator.com/item?id=40287354 . i guess the
| upvotes are today's lucky 10,000 learning about it for the
| first time.
| addisonj wrote:
| It is interesting seeing the comments here, the comments from
| adopters is there is a lot of value but it taking time to get up
| to speed. Those new to temporal a lot of questions seeking
| understanding.
|
| I have spent a lot of time in the adjacent space of event driven
| systems and there, like here, it seems like some of the biggest
| challenge is just education.
|
| I wouldn't say that EDA or workflow based systems are preferable
| to traditional API services with DBs, just that the space they
| occupy in the industry is so large that I think it is really
| really hard if you to introducing any different paradigms, even
| when you focus on domains where API services aren't a great fit
| (like here with long running, complex operations).
|
| My point with this comment is simply that I think if you are
| trying to build anything that does things differently, developer
| education is as important or even more important than design and
| architecture, but often not considered because those building
| these systems are already so deep into it that they can't
| approach the problem as an outsider.
| BiteCode_dev wrote:
| I like the concept.
|
| The nice things is that it abstract the conditions checks on
| whether something is done, has succeeded or should be retried.
|
| The bad things is that it abstract the conditions checks on
| whether something is done, has succeeded or should be retried.
|
| It's nice because that's something you do again and again, and
| that's a lot of code. A lot of ways it can go wrong.
|
| But it's bad because that's a huge chunk of black box magic that
| may execute remotely. If you need a custom or more optimized
| behavior anywhere in this logic, you are done for. If there is a
| bug/problem in this logic, it's game over. I also have to imagine
| debugging and error reporting is likely not super fun.
|
| One point in particular that strikes me, is that impotence is
| generically guaranteed with something like "has this task
| executed without error last time". But usually, what I want is
| something much more specific, like "has that entries been
| updated", "has that file been created" and so on. From a bird
| view, it looks the same, but from a system reliability point of
| you, they are not at all the same.
|
| Hard to see how they avoid duplicate results, overlapping tasks,
| etc.
|
| I don't think they really can at that level of abstraction, which
| means you need to implement it manually.
|
| Eventually it seems it's a huge dep to bring in for the actual
| practical problem is really solves well.
|
| But I'm willing to be proven wrong on this one, because the tech
| is really damn cool.
| fortylove wrote:
| I find Temporal itself to be effectively a clone of Amazon's
| Simple Workflow (SWF)
| kodablah wrote:
| That's no coincidence, Temporal is founded by the creators of
| Amazon Simple Workflow. See https://temporal.io/about.
| ub-volta-toss wrote:
| Temporal is really neat but I think its marketed at too many use
| cases.
|
| After a year of high-scale Temporal work, I found it was only
| good for low-scale work.
|
| The onboarding and learning curve were insanely difficult and
| complex. Ultimately it doesn't scale as well as you think. The
| temporal team invented their own database to get around this
| limitation.
___________________________________________________________________
(page generated 2024-05-09 23:01 UTC)