[HN Gopher] Show HN: Hatchet - Open-source distributed task queue
       ___________________________________________________________________
        
       Show HN: Hatchet - Open-source distributed task queue
        
       Hello HN, we're Gabe and Alexander from Hatchet
       (https://hatchet.run), we're working on an open-source, distributed
       task queue. It's an alternative to tools like Celery for Python and
       BullMQ for Node.js, primarily focused on reliability and
       observability. It uses Postgres for the underlying queue.  Why
       build another managed queue? We wanted to build something with the
       benefits of full transactional enqueueing - particularly for
       dependent, DAG-style execution - and felt strongly that Postgres
       solves for 99.9% of queueing use-cases better than most
       alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ
       uses Redis). Since the introduction of SKIP LOCKED and the
       milestones of recent PG releases (like active-active replication),
       it's becoming more feasible to horizontally scale Postgres across
       multiple regions and vertically scale to 10k TPS or more. Many
       queues (like BullMQ) are built on Redis and data loss can occur
       when suffering OOM if you're not careful, and using PG helps avoid
       an entire class of problems.  We also wanted something that was
       significantly easier to use and debug for application developers. A
       lot of times the burden of building task observability falls on the
       infra/platform team (for example, asking the infra team to build a
       Grafana view for their tasks based on exported prom metrics). We're
       building this type of observability directly into Hatchet.  What do
       we mean by "distributed"? You can run workers (the instances which
       run tasks) across multiple VMs, clusters and regions - they are
       remotely invoked via a long-lived gRPC connection with the Hatchet
       queue. We've attempted to optimize our latency to get our task
       start times down to 25-50ms and much more optimization is on the
       roadmap.  We also support a number of extra features that you'd
       expect, like retries, timeouts, cron schedules, dependent tasks. A
       few things we're currently working on - we use RabbitMQ (confusing,
       yes) for pub/sub between engine components and would prefer to just
       use Postgres, but didn't want to spend additional time on the
       exchange logic until we built a stable underlying queue. We are
       also considering the use of NATS for engine-engine and engine-
       worker connections.  We'd greatly appreciate any feedback you have
       and hope you get the chance to try out Hatchet.
        
       Author : abelanger
       Score  : 250 points
       Date   : 2024-03-08 17:07 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | topicseed wrote:
       | What specific strategies does Hatchet employ to guarantee fault
       | tolerance and enable durable execution? How does it handle
       | partial failures in multi-step workflows?
        
         | abelanger wrote:
         | Each task in Hatchet is backed by a workflow [1]. Workflows are
         | predefined steps which are persisted in PostgreSQL. If a worker
         | dies or crashes midway through (stops heartbeating to the
         | engine), we reassign tasks (assuming they have retries left).
         | We also track timeouts in the database, which means if we miss
         | a timeout, we simply retry after some amount of time. Like I
         | mentioned in the post, we avoid some classes of faults just by
         | relying on PostgreSQL and persisting each workflow run, so you
         | don't need to time out with distributed locks in Redis, for
         | example, or worry about data loss if Redis OOMs. Our `ticker`
         | service is basically its own worker which is assigned a lease
         | for each step run.
         | 
         | We also store the input/output of each workflow step in the
         | database. So resuming a multi-step workflow is pretty simple -
         | we just replay the step with the same input.
         | 
         | To zoom out a bit - unlike many alternatives [2], the execution
         | path of a multi-step workflow in Hatchet is declared ahead of
         | time. There are tradeoffs to this approach; it makes it much
         | easier to run a single-step workflow or if you know the
         | workflow execution path ahead of time. You also avoid classes
         | of problems related to workflow versioning, we can gracefully
         | drain older workflow version with a different execution path.
         | It's also more natural to debug and see a DAG execution instead
         | of debugging procedural logic.
         | 
         | The clear tradeoff is that you can't try...catch the execution
         | of a single task or concatenate a bunch of futures that you
         | wait for later. Roadmap-wise, we're considering adding
         | procedural execution on top of our workflows concept. Which
         | means providing a nice API for calling `await workflow.run` and
         | capturing errors. These would be a higher-level concept in
         | Hatchet and are not built yet.
         | 
         | There are some interesting concepts around using semaphores and
         | durable leases that are relevant here, which we're exploring
         | [3].
         | 
         | [1] https://docs.hatchet.run/home/basics/workflows [2]
         | https://temporal.io [3]
         | https://www.citusdata.com/blog/2016/08/12/state-machines-to-...
        
           | topicseed wrote:
           | Thank you for the thorough response!
        
           | spenczar5 wrote:
           | What happens if a worker goes silent for longer than the
           | heartbeat duration, then a new worker is spawned, then the
           | original worker "comes back to life"? For example, because
           | there was a network partition, or because the first worker's
           | host machine was sleeping, or even just that the first worker
           | process was CPU starved?
        
             | abelanger wrote:
             | The heartbeat duration (5s) is not the same as the inactive
             | duration (60s). If a worker has been down for 60 seconds,
             | we reassign to provide some buffer and handle unstable
             | networks. Once someone asks we'll expose these options and
             | make them configurable.
             | 
             | We currently send cancellation signals for individual tasks
             | to workers, but our cancellation signals aren't replayed if
             | they fail on the network. This is an important edge case
             | for us to figure out.
             | 
             | There's not much we can do if the worker ignores that
             | signal. We should probably add some alerting if we see
             | multiple responses on the same task, because that means the
             | worker is ignoring the cancellation signal. This would also
             | be a problem if workloads start blocking the whole thread.
        
           | sigmarule wrote:
           | I think the answer is no but just to be sure: are you able to
           | trigger step executions programmatically from within a step,
           | even if you can't await their results?
           | 
           | Related, but separately: can you trigger a variable number of
           | task executions from one step? If the answer to the previous
           | question is yes then it would of course be trivial; if not,
           | I'm wondering if you could i.e. have a task act as a
           | generator and yield values, or just return a list, and have
           | each individual item get passed off to its own execution of
           | the next task(s) in the DAG.
           | 
           | For example some of the examples involve a load_docs step,
           | but all loaded docs seem to be passed to the next step
           | execution in the DAG together, unless I'm just
           | misunderstanding something. How could we tweak such an
           | example to have a separate task execution per document
           | loaded? The benefits of durable execution and being able to
           | resume an intensive workflow without repeating work is
           | lessened if you can't naturally/easily control the size of
           | the unit of work for task executions.
        
             | abelanger wrote:
             | You can execute a new workflow programmatically, for
             | example see [1]. So people have triggered, for example, 50
             | child workflows from a parent step. As you've identified
             | the difficult part there is the "collect" or "gathering"
             | step, we've had people hack around that by waiting for all
             | the steps from a second workflow (and falling back to the
             | list events method to get status), but this isn't an
             | approach I'd recommend and it's not well documented. And
             | there's no circuit breaker.
             | 
             | > I'm wondering if you could i.e. have a task act as a
             | generator and yield values, or just return a list, and have
             | each individual item get passed off to its own execution of
             | the next task(s) in the DAG.
             | 
             | Yeah, we were having a conversation yesterday about this -
             | there's probably a simple decorator we could add so that if
             | a step returns an array, and a child step is dependent on
             | that parent step, it fans out if a `fanout` key is set. If
             | we can avoid unstructured trace diagrams in favor of a nice
             | DAG-style workflow execution we'd prefer to support that.
             | 
             | The other thing we've started on is propagating a single
             | "flow id" to each child workflow so we can provide the same
             | visualization/tracing that we provide in each workflow
             | execution. This is similar to AWS X-rays.
             | 
             | As I mentioned we're working on the durable workflow model,
             | and we'll find a way to make child workflows durable in the
             | same way activities (and child workflows) are durable on
             | Temporal.
             | 
             | [1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-
             | clien...
        
       | nextworddev wrote:
       | I'm interested in self hosting this. What's the recommendation
       | here for state persistence and self healing? Wish there was a
       | guide for a small team who wants to self host before trying
       | managed cloud
        
         | abelanger wrote:
         | I think we might have had a dead link in the README to our
         | self-hosting guide, here it is: https://docs.hatchet.run/self-
         | hosting.
         | 
         | The component which needs the highest uptime is our ingestion
         | service [1]. This ingests events from the Hatchet SDKs and is
         | responsible for writing the workflow execution path, and then
         | sends messages downstream to our other engine components. This
         | is a horizontally scalable service and you should run at least
         | 2 replicas across different AZs. Also see how to configure
         | different services for engine components [2].
         | 
         | The other piece of this is PostgreSQL, use your favorite
         | managed provider which has point-in-time restores and backups.
         | This is the core of our self-healing, I'm not sure where it
         | makes sense to route writes if the primary goes down.
         | 
         | Let me know what you need for self-hosted docs, happy to write
         | them up for you.
         | 
         | [1] https://github.com/hatchet-
         | dev/hatchet/tree/main/internal/se... [2]
         | https://docs.hatchet.run/self-hosting/configuration-options#...
        
       | sixhobbits wrote:
       | One of my favourite spaces and presentation in readme is clear
       | and immediately told me what it is and most of the key
       | information that I usually complain is missing.
       | 
       | However I am still missing a section on why this is different
       | than any of the other existing and more mature solutions. What
       | led you to develop this over existing options and what different
       | tradeoffs did you make? Extra points if you can concisely tell me
       | what you do badly that your 'competitors' do well because I don't
       | believe there is a one best solution in this space, it is all
       | tradeoffs
        
         | sixhobbits wrote:
         | Sorry I am dumb and commented after clicking on the link. I
         | would just add your hn text to the readme as that is exactly
         | what I was looking for
        
           | abelanger wrote:
           | Done [1]. We'll expand this section over time. There are also
           | definite tradeoffs to our architecture - spoke to someone
           | wanting the equivalent 1.5m PutRecord/s in Kinesis, which
           | we're definitely not ready for because we're persist every
           | event + task execution in Postgres.
           | 
           | [1] https://github.com/hatchet-
           | dev/hatchet/blob/main/README.md#h...
        
       | tzahifadida wrote:
       | Why not use postgres listen/notify instead of rabbitmq pub sub.
        
         | anentropic wrote:
         | It uses Postgres rather than RabbitMQ:
         | https://github.com/hatchet-dev/hatchet?tab=readme-ov-file#ho...
        
           | anentropic wrote:
           | I see... apparently it uses both
        
         | abelanger wrote:
         | When I started on this codebase, we needed to implement some
         | custom exchange logic that maps very neatly to fanout exchanges
         | and non-durable queues in RabbitMQ and weren't built out on our
         | PostgreSQL layer yet. This was a bootstrapping problem. Like I
         | mentioned in the comment, we'd like to switch to pub/sub
         | pattern that lets us distribute our engine over multiple
         | geographies. Listen/notify could be the answer once we migrate
         | to PG 16, though there are some concerns around connection
         | poolers like pg_bouncer having limited support for
         | listen/notify. There's a Github discussion on this if you're
         | curious: https://github.com/hatchet-
         | dev/hatchet/discussions/224.
        
           | tzahifadida wrote:
           | I use haproxy with go listen notify of one of the libs. It
           | works as long as the connection is up. I.e.i have a timeout
           | of 30 min configured in haproxy. Then you have to assume you
           | lost sync and recheck. That is not that bad every 30min... at
           | least for me. You can configure to never close...
        
       | Kluggy wrote:
       | In https://docs.hatchet.run/home/quickstart/installation, it says
       | 
       | > Welcome to Hatchet! This guide walks you through getting set up
       | on Hatchet Cloud. If you'd like to self-host Hatchet, please see
       | the self-hosted quickstart instead.
       | 
       | but the link to "self-hosted quickstart" links back to the same
       | page
        
         | abelanger wrote:
         | This should be fixed now, here's the direct link:
         | https://docs.hatchet.run/self-hosting.
        
       | kevinlu1248 wrote:
       | We're building a webhook services on FastAPI + Celery + Redis +
       | Grafana + Loki and the experience with setting up every service
       | incrementally was miserable, and even then it feels like logs are
       | being dropped and we run into reliability issues. Felt like
       | something like this should exist already but I couldn't find
       | anything at the time. Really excited to see where this takes us!
        
         | tasn wrote:
         | That's exactly why we built Svix[1]. Building webhooks
         | services, even with amazing tools like FastAPI, Celery and
         | Redis is still a big pain. So we just built a product to solve
         | it.
         | 
         | Hatchet looks cool nonetheless. Queues are a pain for many
         | other use-cases too.
         | 
         | 1: https://www.svix.com
        
       | pyrossh wrote:
       | How is this different from pg-boss[1]? Other than the distributed
       | part it also seems to use skip locked.
       | 
       | [1] https://github.com/timgit/pg-boss
        
         | abelanger wrote:
         | I haven't used pg-boss, and feature-wise it looks very similar
         | and is an impressive project.
         | 
         | The core difference is that pg-boss is a library while Hatchet
         | is a separate service which runs independently of your workers.
         | This service also provides a UI and API for interacting with
         | Hatchet - I don't think pg-boss has those things, so you'd
         | probably have to build out observability yourself.
         | 
         | This doesn't make a huge difference when you're at 1 worker,
         | but having each worker poll your database can lead to DB issues
         | if you're not careful - I've seen some pretty low-throughput
         | setups for very long-running jobs using a database with 60 CPUs
         | because of polling workers. Hatchet distributes in two layers -
         | the "engine" and the "worker" layer. Each engine polls the
         | database and fans out to the workers over a long-lived gRPC
         | connection. This reduces pressure on the DB and lets us manage
         | which workers to assign tasks to based on things like max
         | concurrent runs on each worker or worker health.
        
       | radus wrote:
       | You've explained your value proposition vs. celery, but I'm
       | curious if you also see Hatchet as an alternative to
       | Nextflow/Snakemake which are commonly used in bioinformatics.
        
       | bluehadoop wrote:
       | How does this compare against Temporal/Cadence/Conductor? Does
       | hatchet also support durable execution?
       | 
       | https://temporal.io/ https://cadenceworkflow.io/
       | https://conductor-oss.org/
        
         | abelanger wrote:
         | It's very similar - I used Temporal at a previous company to
         | run a couple million workflows per month. The gRPC networking
         | with workers is the most similar component, I especially liked
         | that I only had to worry about an http2 connection with mTLS
         | instead of a different broker protocol.
         | 
         | Temporal is a powerful system but we were getting to the point
         | where it took a full-time engineer to build an observability
         | layer around Temporal. Integrating workflows in an intuitive
         | way with OpenTelemetry and logging was surprisingly non-
         | arbitrary. We wanted to build more of a Vercel-like experience
         | for managing workflows.
         | 
         | We have a section on the docs page for durable execution [1],
         | also see the comment on HN [2]. Like I mention in that comment,
         | we still have a long way to go before users can write a full
         | workflow in code in the same style as a Temporal workflow,
         | users either define the execution path ahead of time or invoke
         | a child workflow from an existing workflow. This is also
         | something that requires customization for each SDK - like
         | Temporal's custom asyncio event loop in their Python SDK [3].
         | We don't want to roll this out until we can be sure about
         | compatibility with the way most people write their functions.
         | 
         | [1] https://docs.hatchet.run/home/features/durable-execution
         | 
         | [2] https://news.ycombinator.com/item?id=39643881
         | 
         | [3] https://github.com/temporalio/sdk-python
        
           | bicijay wrote:
           | Well, you just got an user. Love the concept of temporal, but
           | i can't justify the overhead you need with infra to make it
           | work for the upper guys... And the cloud offering is a bit
           | expensive for small companies.
        
       | kcorbitt wrote:
       | I love your vision and am excited to see the execution! I've been
       | looking for _exactly_ this product (postgres-backed task queue
       | with workers in multiple languages and decent built-in
       | observability) for like... 3 years. Every 6 months I 'll check in
       | and see if someone has built it yet, evaluate the alternatives,
       | and come away disappointed.
       | 
       | One important feature request that probably would block our
       | adoption: one reason why I prefer a postgres-backed queue over
       | eg. Redis is just to simplify our infra by having fewer servers
       | and technologies in the stack. Adding in RabbitMQ is definitely
       | an extra dependency I'd really like to avoid.
       | 
       | (Currently we've settled on graphile-worker which is fine for
       | what it does, but leaves a lot of boxes unchecked.)
        
         | abelanger wrote:
         | Thank you, appreciate the kind words! What boxes are you
         | looking to check?
         | 
         | Yes, I'm not a fan of the RabbitMQ dependency either - see here
         | for the reasoning:
         | https://news.ycombinator.com/item?id=39643940.
         | 
         | It would take some work to replace this with listen/notify in
         | Postgres, less work to replace this with an in-memory
         | component, but we can't provide the same guarantees in that
         | case.
        
         | BenjieGillam wrote:
         | Not sure if you saw it but Graphile Worker supports jobs
         | written in arbitrary languages so long as your OS can execute
         | them: https://worker.graphile.org/docs/tasks#loading-
         | executable-fi...
         | 
         | Would be interested to know what features you feel it's
         | lacking.
        
         | doctorpangloss wrote:
         | Why does the RabbitMQ dependency matter?
         | 
         | It was pretty painless for me to set up and write tests
         | against. The operator works well and is really simple if you
         | want to save money.
         | 
         | I mean, isn't Hatchett another dependency? Graphile Worker? I
         | like all these things, but why draw the line at one thing over
         | another over essentially aesthetics?
         | 
         | You better start believing in dependencies if you're a
         | programmer.
        
           | eska wrote:
           | Introducing another piece of software instead of using one
           | you already use anyway introduces new failures. That's hardly
           | aesthetics.
           | 
           | As a professional I'm allergic to statements like "you better
           | start believing in X". How can you even have objective
           | discourse at work like that?
        
             | doctorpangloss wrote:
             | > Introducing another piece of software instead of using
             | one you already use anyway introduces new failures.
             | 
             | Okay, but we're talking about this on a post about using
             | another piece of software.
             | 
             | What is the rational for, well this additional dependency,
             | Hatchet, that's okay, and its inevitable failures are okay,
             | but this other dependency, RabbitMQ, which does something
             | different, but will have fewer failures for some objective
             | reasons, that's not okay?
             | 
             | Hatchet is very much about aesthetics. What else does
             | Hatchet have going on? It doesn't have a lot of history,
             | it's going to have a lot of bugs. It works as a DSL written
             | in Python annotations, which is very much an aesthetic
             | choice, very much something I see a bunch of AI startups
             | doing, which I personally think is kind of dumb. Like
             | OpenAI tools are "just" JSON schemas, they don't reinvent
             | everything, and yet Trigger, Hatchet, Runloop, etc.,
             | they're all doing DSLs. It hews to a specific promotional
             | playbook that is also very aesthetic. Is this not the
             | "objective discourse at work" you are looking for?
             | 
             | I am not saying it is bad, I am saying that 99% of people
             | adopting it will be doing so for essentially aesthetic
             | reasons - and being less knowledgable about alternatives
             | might describe 50-80% of the audience, but to me, being
             | less knowledgeable as a "professional" is an aesthetic
             | choice. There's nothing wrong with this.
             | 
             | You can get into the weeds about what you meant by whatever
             | you said. I am aware. But I am really saying, I'm dubious
             | of anyone promoting "Use my new thing X which is good
             | because it doesn't introduce a new dependency." It's an
             | oxymoron plainly on its face. It's not in their marketing
             | copy but the author is talking about it here, and maybe the
             | author isn't completely sincere, maybe the author doesn't
             | care and will happily write everything on top of RabbitMQ
             | if someone were willing to pay for it, because that
             | decision doesn't really matter. The author is just being
             | reactive to people's aesthetics, that programmers on social
             | media "like" Postgres more than RabbitMQ, for reasons, and
             | that means you can "only" use one, but that none of those
             | reasons are particularly well informed by experience or
             | whatever, yet nonetheless strongly held.
             | 
             | When you want to explain something that doesn't make
             | objective sense when read literally, okay, it might have an
             | aesthetic explanation that makes more sense.
        
               | danielovichdk wrote:
               | I fully agree with you.
               | 
               | 'But I am really saying, I'm dubious of anyone promoting
               | "Use my new thing X which is good because it doesn't
               | introduce a new dependency."'
               | 
               | "Advances in software technology and increasing economic
               | pressure have begun to break down many of the barriers to
               | improved software productivity. The ${PRODUCT} is
               | designed to remove the remaining barriers [...]"
               | 
               | It reads like the above quote from the pitch of r1000 in
               | 1985. https://datamuseum.dk/bits/30003882
        
           | blandflakes wrote:
           | And you better start critically assessing dependencies if
           | you're a programmer. They aren't free; this is a wild take.
        
         | ako wrote:
         | Funny how this is vision now. I started my career 29 years ago
         | at a company that build exactly this, but based on oracle. The
         | agents would run on Solaris, aix, vax vms, hpux, windows nt,
         | iris, etc. Was also used to create an automated cicd pipeline
         | to build all binaries on all these different systems.
        
         | bevekspldnw wrote:
         | You can do a fair amount of this with Postgres using locks out
         | of the box. It's not super intuitive but I've been using just
         | Postgres and locks in production for many years for large task
         | distribution across independent nodes.
        
       | toddmorey wrote:
       | I need task queues where the client (web browser) can listen to
       | the progress of the task through completion.
       | 
       | I love the simplicity & approachability of Deno queues for
       | example, but I'd need to roll my own way to subscribe to task
       | status from the client.
       | 
       | Wondering if perhaps the Postgres underpinnings here would make
       | that possible.
       | 
       | EDIT: seems so! https://docs.hatchet.run/home/features/streaming
        
         | rad_gruchalski wrote:
         | If you need to listen for the progress only, try server-sent
         | events, maybe?: https://en.wikipedia.org/wiki/Server-
         | sent_events
         | 
         | It's dead simple: an existence of the URI means the
         | topic/channel/whathaveu exists, to access it one needs to know
         | the URI, data streamed but no access to old data, multiple
         | consumers no problem.
        
         | abelanger wrote:
         | Yep, exactly - Gabe has also been thinking about providing per-
         | user signed URLs to task executions so clients can subscribe
         | more easily without a long-lived token. So basically, you would
         | start the workflow from your API, and pass back the signed URL
         | to the client, where we would then provide a React hook to get
         | task updates automatically. We need this ourselves once we open
         | our cloud instance up to self-serve, since we want to provision
         | separate queues per user, with a Hatchet workflow of course.
        
           | toddmorey wrote:
           | Awesome to hear!
        
       | jerrygenser wrote:
       | Something I really like about some pub/sub systems is Push
       | subscriptions. For example in GCP pub/sub you can have a
       | "subscriber" that is not pulling events off the queue but instead
       | is an http endpoint where events are pushed to.
       | 
       | The nice thing about this is that you can use a runtime like
       | cloud run or lambda and allow that runtime to scale based on http
       | requests and also scale to zero.
       | 
       | Setting up autoscaling for workers can be a little bit more
       | finicky, e.g. in kubernetes you might set up KEDA autoscaling
       | based on some queue depth metrics but these might need to be
       | exported from rabbit.
       | 
       | I suppose you could have a setup where your daemon worker is
       | making http requests and in that sense "push" to the place where
       | jobs are actually running but this adds another level of
       | complexity.
       | 
       | Is there any plan to support a push model where you can push jobs
       | into http and some daemons that are holding the http connections
       | opened?
        
         | abelanger wrote:
         | I like that idea, basically the first HTTP request ensures the
         | worker gets spun up on a lambda, and the task gets picked up on
         | the next poll when the worker is running. We already have the
         | underlying push model for our streaming feature:
         | https://docs.hatchet.run/home/features/streaming. Can configure
         | this to post to an HTTP endpoint pretty easily.
         | 
         | The daemon feels fragile to me, why not just shut down the
         | worker client-side after some period of inactivity?
        
           | jerrygenser wrote:
           | I think it depends on the http runtime. One of the things
           | with cloud run is that if the server is not handling
           | requests, it doesn't get CPU time. So even if the first
           | request is "wake up", it wouldn't get any CPU to poll outside
           | of the request-response cycle.
           | 
           | You can configure cloud run to always allocate CPU but it's a
           | lot more expensive. I don't think it would be a good
           | autoscaling story since autoscaling is based on http requests
           | being processed. (maybe can be done via CPU but that's may
           | not be what you want, it may not even be cpu bound)
        
         | alexbouchard wrote:
         | The push queue model has major benefits has you mentioned.
         | We've built Hookdeck (hookdeck.com) on that premise. I hope we
         | see more projects adopt it.
        
       | dalberto wrote:
       | I'm curious if this supports coroutines at tasks in Python. It's
       | especially useful for genAI, and legacy queues (namely Celery)
       | are lacking in this regard.
       | 
       | It would help to see a mapping of Celery to Hatchet as examples.
       | The current examples require you to understand (and buy into)
       | Hatchet's model, but that's hard to do without understanding how
       | it compares to existing solutions.
        
       | SCUSKU wrote:
       | Looks pretty great! My biggest issue with Celery has been that
       | the observability is pretty bad. Even if you use Celery Flower,
       | it still just doesn't give me enough insight when I'm trying to
       | debug some problem in production.
       | 
       | I'm all for just using Postgres in service of the grug brain
       | philosophy.
       | 
       | Will definitely be looking into this, congrats on the launch!
        
         | abelanger wrote:
         | Appreciate it, thank you! We've spent quite a bit of time in
         | the Celery Flower console. Admittedly it's been a while, I'm
         | not sure if they've added views for chains/groups/etc - it was
         | just a linear task view when I used it.
         | 
         | A nice thing in Celery Flower is viewing the `args, kwargs`,
         | whereas Hatchet operates on JSON request/response bodies, so
         | some early users have mentioned that it's hard to get
         | visibility into the exact typing/serialization that's
         | happening. Something for us to work on.
        
         | 9dev wrote:
         | I case you're stuck with Celery for a while: I was hit with
         | this same problem, and solved it by adding a sidecar HTTP
         | server thread to the Python workers that would expose metrics
         | written by the workers into a multithreaded registry. This has
         | been working amazingly well in production for over two years
         | now, and makes it really straightforward to get custom metrics
         | out of a distributed Celery app.
        
       | acaloiar wrote:
       | A related lively dicussion from a few months ago:
       | https://news.ycombinator.com/item?id=37636841
       | 
       | Long live Postgres queues.
        
       | leetrout wrote:
       | Just pointing out even though this is a "Show HN" they are,
       | indeed, backed by YC.
       | 
       | Is this going to follow the "open core" pattern or will there be
       | a different path to revenue?
        
         | MuffinFlavored wrote:
         | > path to revenue
         | 
         | There have to be at least 10 different ways between different
         | cloud providers to run a distributed task queue. Amazon, Azure,
         | GCP
         | 
         | Self-hosting RabbitMQ, etc.
         | 
         | I'm curious how they are able to convince investors that there
         | is a sizable portion of market they think doesn't already have
         | this solved (or already has it solved and is willing to
         | migrate)
        
           | Kinrany wrote:
           | There will be space for improvement until every cloud has a
           | managed offering with exactly the same interface. Like
           | docker, postgres, S3.
        
           | leetrout wrote:
           | I am curious to see where they differentiate themselves on
           | observability on the longer run.
           | 
           | Comparing to rabbitmq it should be easier to see what is in
           | the queue itself without mutating it, for instance.
        
         | abelanger wrote:
         | Yep, we're backed by YC in the W24 batch - this is evident on
         | our landing page [1].
         | 
         | We're both second time CTOs and we've been on both sides of
         | this, as consumers of and creators of OSS. I was previously a
         | co-founder and CTO of Porter [2], which had an open-core model.
         | There are two risks that most companies think about in the open
         | core model:
         | 
         | 1. Big companies using your platform without contributing back
         | in some way or buying a license. I think this is less of a
         | risk, because these organizations are incentivized to buy a
         | support license to help with maintenance, upgrades, and since
         | we sit on a critical path, with uptime.
         | 
         | 2. Hyperscalers folding your product in to their offering [3].
         | This is a bigger risk but is also a bit of a "champagne
         | problem".
         | 
         | Note that smaller companies/individual developers are who we'd
         | like to enable, not crowd out. If people would like to use our
         | cloud offering because it reduces the headache for them, they
         | should do so. If they just want to run our service and manage
         | their own PostgreSQL, they should have the option to do that
         | too.
         | 
         | Based on all of this, here's where we land on things:
         | 
         | 1. Everything we've built so far has been 100% MIT licensed.
         | We'd like to keep it that way and make money off of Hatchet
         | Cloud. We'll likely roll out a separate enterprise support
         | agreement for self hosting.
         | 
         | 2. Our cloud version isn't going to run a different core engine
         | or API server than our open source version. We'll write
         | interfaces for all plugins to our servers and engines, so even
         | if we have something super specific to how we've chosen to do
         | things on the cloud version, we'll expose the options to write
         | your own plugins on the engine and server.
         | 
         | 3. We'd like to make self-hosting as easy to use as our cloud
         | version. We don't want our self-hosted offering to be a second-
         | class citizen.
         | 
         | Would love to hear everyone's thoughts on this.
         | 
         | [1] https://hatchet.run
         | 
         | [2] https://github.com/porter-dev/porter
         | 
         | [3] https://www.elastic.co/blog/why-license-change-aws
        
       | Fiahil wrote:
       | How does this compare to ZeroMQ (ZMQ) ?
       | 
       | https://zeromq.org/
        
         | vector_spaces wrote:
         | Not the OP or familiar with Hatchet, but generally ZeroMQ is a
         | bit lower down in the stack -- it's something you'd build a
         | distributed task queue or protocol on top of, but not something
         | you'd usually reach for if you needed one for a web service or
         | similar unless you had very special requirements and a
         | specific, careful design in mind.
         | 
         | This tool comes with more bells and whistles and presumably
         | will be more constrained in what you can do with it, where
         | ZeroMQ gives you the flexibility to build your own protocol. In
         | principle they have many of the same use cases, like how you
         | can buy ready made whipped cream or whip up your own with some
         | heavy cream and sugar -- one approach is more constrained but
         | works for most situations where you need some whipped cream,
         | and the other is a lot more work and somewhat higher risk (you
         | can over whip your cream and end up with butter), but you can
         | do a lot more with it.
        
         | jeremyjh wrote:
         | ZeroMQ is a library that implements an application layer
         | network protocol. Hatchet is a distributed job server with
         | durability and transaction semantics. Two completely different
         | things at very different levels of the stack. ZeroMQ supports
         | fan-out messaging and other messaging patterns that could maybe
         | be used as part of a job server, but it doesn't have anything
         | to say about durability, retries, or other concerns that job
         | servers take care of, much less a user interface.
        
       | Kinrany wrote:
       | With NATS in the stack, what's the advantage over using NATS
       | directly?
        
       | sroussey wrote:
       | Ah nice! I am writing a job queue this weekend for a DAG based
       | task runner, so timing is great. I will have a look. I don't need
       | anything too big, but I have written some stuff for using
       | PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-
       | memory, depending on what I want to use it for.
       | 
       | I want the task graph to run without thinking about retries,
       | timeouts, serialized resources, etc.
       | 
       | Interested to look at your particular approach.
        
       | zwaps wrote:
       | You say this is for generative AI. How do you distribute
       | inference across workers? Can one use just any protocol and how
       | does this work together with the queue and fault tolerance?
       | 
       | Could not find any specifics on generative AI in your docs.
       | Thanks
        
         | abelanger wrote:
         | This isn't built specifically for generative AI, but generative
         | AI apps typically have architectural issues that are solved by
         | a good queueing system and worker pool. This is particularly
         | true once you start integrating smaller, self-hosted LLMs or
         | other types of models into your pipeline.
         | 
         | > How do you distribute inference across workers?
         | 
         | In Hatchet, "run inference" would be a task. By default, tasks
         | get randomly assigned to workers in a FIFO fashion. But we give
         | you a few options for controlling how tasks get ordered and
         | sent. For example, let's say you'd like to limit users to 1
         | inference task at a time per session. You could do this by
         | setting a concurrency key "<session-id>" and `maxRuns=1` [1].
         | This means that for each session key, you only run 1 inference
         | task. The purpose of this would be fairness.
         | 
         | > Can one use just any protocol
         | 
         | We handle the communication between the worker and the queue
         | through a gRPC connection. We assume that you're passing JSON-
         | serializable objects through the queue.
         | 
         | [1] https://docs.hatchet.run/home/features/concurrency/round-
         | rob...
        
       | hinkley wrote:
       | It's been about a dozen years since I heard someone assert that
       | some CI/CD services were the most reliable task scheduling
       | software for periodic tasks (far better than cron). Shouldn't the
       | scheduling be factored out as a separate library?
       | 
       | I found that shocking at the time, if plausible, and wondered why
       | nobody pulled on that thread. I suppose like me they had bigger
       | fish to fry.
        
         | abelanger wrote:
         | This reminds me of:
         | https://news.ycombinator.com/item?id=28234057
         | 
         | If you're saying that the scheduling in Hatchet should be a
         | separate library, we rely on go-cron [1] to run cron schedules.
         | 
         | [1] https://github.com/go-co-op/gocron
        
       | cybice wrote:
       | Why Hatchet might be better than Windmill: Windmill uses the same
       | approach in PostgreSQL, very fast and has an incredibly good UI.
        
       | treesciencebot wrote:
       | Latency is really important and that is honestly why we re-wrote
       | most of this stuck ourselves but the project with the gurantee of
       | 25ms< looks interesting. I wish there was an "instant" mode where
       | enough workers are available it could just do direct placement.
        
         | abelanger wrote:
         | To be clear, the 25ms isn't a guarantee. We have a load testing
         | CLI [1] and the secondary steps on multi-step workflows are in
         | the range of 25ms, while the first steps are in the range of
         | 50ms, so that's what I'm referencing.
         | 
         | There's still a lot of work to do for optimization though,
         | particularly to improve the polling interval if there aren't
         | workers available to run the task. Some people might expect to
         | set a max concurrency limit of 1 on each worker and have each
         | subsequent workflow take 50ms to start, which isn't be the case
         | at the moment.
         | 
         | [1] https://github.com/hatchet-
         | dev/hatchet/tree/main/examples/lo...
        
       | moribvndvs wrote:
       | One repeat issue I've had with my past position is need to
       | schedule an unlimited number of jobs, often months to year from
       | now. Example use case: a patient schedules an appointment for a
       | follow up in 6 months, so I schedule a series of appointment
       | reminders in the days leading up to it. I might have millions of
       | these jobs.
       | 
       | I started out by just entering a record into a database queue and
       | just polling every few seconds. Functional, but our IO costs for
       | polling weren't ideal, and we wanted to distribute this without
       | using stuff like schedlock. I switched to Redis but it got
       | complicated dealing with multiple dispatchers, OOM issues, and
       | having to run a secondary job to move individual tasks in and out
       | of the immediate queue, etc. I had started looking at switching
       | to backing it with PG and SKIP LOCKED, etc. but I've changed
       | positions.
       | 
       | I can see a similar use case on my horizon wondered if Hatchet
       | would be suitable for it.
        
         | herval wrote:
         | why do you need to schedule things 6 months in advance, instead
         | of, say, check everything that needs notifications in a rolling
         | window (eg 24h ahead) and schedule those?
        
           | moribvndvs wrote:
           | Well, it was a dumbed down example. In that particular case,
           | appointments can be added, removed, or moved at any moment,
           | so I can't just run one job every 24 hours to tee up the next
           | day's work and leave it at that. Simply polling the database
           | for messages that are due to go out gives me my just-in-time
           | queue, but then I need to build out the work to distribute
           | it, and we didn't like the IO costs.
           | 
           | I did end up moving it Redis and basically ZADD an execution
           | timestamp and job ID, then ZRANGEBYSCORE at my desired
           | interval and remove those jobs as I successfully distribute
           | them out to workers. I then set a fence time. At that time a
           | job runs to move stuff that should have ran but didn't (rare,
           | thankfully) into a remediation queue, and load the next block
           | of items that should run between now + fence. At the service
           | level, any items with a scheduled date within the fence gets
           | ZADDed after being inserted into the normal database.
           | Anything outside the fence will be picked up at the
           | appropriate time.
           | 
           | This worked. I was able to ramp up the polling time to get
           | near-real time dispatch while also noticeably reducing costs.
           | Problems were some occasional Redis issues (OOM and having to
           | either a keep bumping up the Redis instance size or reduce
           | the fence duration), allowing multiple pollers for redundancy
           | and scale (I used schelock for that :/), and occasionally a
           | bug where the poller craps out in the middle of the Redis
           | work resulting in at least once SLA which required downstream
           | protections to make sure I don't send the same message
           | multiple time to the patient.
           | 
           | Again, it all works but I'm interested in seeing if there are
           | solutions that I don't have to hand roll.
        
         | kbar13 wrote:
         | can you explain why this cannot be a simple daily cronjob to
         | query for appointments upcoming next <time window> and send out
         | notifications at that time? polling every few seconds seems way
         | overkill
        
           | moribvndvs wrote:
           | Sure: https://news.ycombinator.com/item?id=39646719
        
         | abelanger wrote:
         | It wouldn't be suitable for that at the moment, but might be
         | after some refactors coming this weekend. I wrote a very quick
         | scheduling API which pushes schedules as workflow triggers, but
         | it's only supported on the Go SDK. It also is CPU-intensive at
         | thousands of schedules, as the schedules are run as separate
         | goroutines (on a dedicated `ticker` service) - I'm not proud of
         | this. This was a pattern that made sense for the cron schedule
         | and I just adapted it for the one-time scheduling.
         | 
         | Looking ahead (and back) in the database and placing an
         | exclusive lock on the schedule is the way to do this. You
         | basically guarantee scheduling at +/- the polling interval if
         | your service goes down while maintaining the lock. This allows
         | you to horizontally scale the `tickers` which are polling for
         | the schedules.
        
       | ctoth wrote:
       | My only question is why did you call it Hatchet if it doesn't cut
       | down on your logs?
       | 
       | I'll show myself out.
        
       | fuddle wrote:
       | Looks great! Do you publish pricing for your cloud offering? For
       | the self hosted option, are there plans to create a Kubernetes
       | operator? With an MIT license do you fear Amazon could create a
       | Amazon Hatchet Service sometime in the future?
        
         | abelanger wrote:
         | Thank you!
         | 
         | > Do you publish pricing for your cloud offering?
         | 
         | Not yet, we're rolling out the cloud offering slowly to make
         | sure we don't experience any widespread outages. As soon as
         | we're open for self-serve on the cloud side, we'll publish our
         | pricing model.
         | 
         | > For the self hosted option, are there plans to create a
         | Kubernetes operator?
         | 
         | Not at the moment, our initial plan was to help folks with a
         | KEDA autoscaling setup based on Hatchet queue metrics, which is
         | something I've done with Sidekiq queue depth. We'll probably
         | wait to build a k8s operator after our existing Helm chart is
         | relatively stable.
         | 
         | > With an MIT license do you fear Amazon could create a Amazon
         | Hatchet Service sometime in the future?
         | 
         | Yes. The question is whether that risk is worth the tradeoff of
         | not being MIT-licensed. There are also paths to getting
         | integrated into AWS marketplace we'll explore longer-term. I
         | added some thoughts here:
         | https://news.ycombinator.com/item?id=39646788.
        
       | rheckart wrote:
       | Any plans for SDKs outside the current three? .NET Core & Java
       | would be interesting to see..
        
         | abelanger wrote:
         | Not at the moment - the biggest ask has been Rails, but on the
         | other hand Sidekiq is so beloved that I'm not sure it makes
         | sense at the moment. We have our hands very full with the 3
         | SDKs, though I'd love for us to support a community-backed SDK.
         | If anyone's interested in working on that, feel free to message
         | us in the Discord.
        
       | mfrye0 wrote:
       | I've been looking for this exact thing for awhile now. I'm just
       | starting to dig into the docs and examples, and I have a question
       | on workflows.
       | 
       | I have an existing pipeline that runs tasks across two K8
       | clusters and share a DB. Is it possible to define steps in a
       | workflow where the step run logic is setup to run elsewhere?
       | Essentially not having an inline run function defined, and
       | another worker process listening for that step name.
        
       | fcsp wrote:
       | > Hatchet is built on a low-latency queue (25ms average start)
       | 
       | That seems pretty long - am I misunderstanding something? By my
       | understanding this means the time from enqueue to job processing,
       | maybe someone can enlighten me.
        
         | mhh__ wrote:
         | It's only a few billion instructions on a decent sized server
         | these days
        
         | abelanger wrote:
         | To clarify - you're right, this is a long time in a
         | message/event queue.
         | 
         | It's not an eternity in a task queue which supports DAG-style
         | workflows with concurrency limits and fairness strategies. The
         | reason for this is you need to check all of the subscribed
         | workers and assign a task in a transactional way.
         | 
         | The limit on the Postgres level is probably on the order of
         | 5-10ms on a managed PG provider. Have a look at:
         | https://news.ycombinator.com/item?id=39593384.
         | 
         | Also, these are not my benchmarks, but have a look at [1] for
         | Temporal timings.
         | 
         | [1] https://www.windmill.dev/blog/launch-week-1/fastest-
         | workflow...
        
       | beerkat wrote:
       | How does this compare to River Queue (https://riverqueue.com/)?
       | Besides the additional Python and TS client libraries.
        
       ___________________________________________________________________
       (page generated 2024-03-08 23:00 UTC)