[HN Gopher] River: A fast, robust job queue for Go and Postgres
___________________________________________________________________
River: A fast, robust job queue for Go and Postgres
Author : bo0tzz
Score : 251 points
Date : 2023-11-20 15:54 UTC (7 hours ago)
(HTM) web link (brandur.org)
(TXT) w3m dump (brandur.org)
| hipadev23 wrote:
| What a strange design. If a job is dependent on an extant
| transaction then perhaps the job should run in the same code that
| initiated the transaction instead of a outside job queue?
|
| Also you pass the data a job needs to run as part of the job
| payload. Then you don't have the "data doesn't exist" issue.
| zackkitzmiller wrote:
| I agree. This design is incredibly strange, and seems to throw
| away basically all distributed systems knowledge. I'm glad
| folks are playing with different ideas, but this one seems off.
| eximius wrote:
| No, this is a fairly common pattern called having an 'outbox'
| where the emission/enquing of your event/message/job is tied
| to the transaction completion of the relevant domain data.
|
| We use this to ensure Kafka events are only emitted when a
| process succeeds, this is very similar.
| iskela wrote:
| So when the the business data transaction commit a notify
| event is raised and a job row is inserted. Out of bound job
| broker listens to a notify event of the job-table or polls
| the table skipping rows and takes work for processing?
| eximius wrote:
| Basically.
|
| For our particular use case, I think we're actually not
| using notify events. We just insert rows into the outbox
| table and the poller re-emits as kafka events and deletes
| successfully emitted events from the table.
| youerbt wrote:
| Either business data and job are committed or none of
| them. Then as you write, either polling or listening to
| an even worker, can pick it up. Bonus stuff, from
| implementation perspective, is that if worker selects row
| FOR UPDATE (locking the job from others to pick up) and
| dies, Postgres will release the lock after some time,
| making the job available for other workers.
| brandur wrote:
| Author here.
|
| Wanting to offload heavy work to a background job is absolute
| as old of a best practice as exists in modern software
| engineering.
|
| This is especially important for the kind of API and/or web
| development that a large number of people on this site are
| involved in. By offloading expensive work, you take that work
| out-of-band of the request that generated it, making that
| request faster and providing a far superior user experience.
|
| Example: User sign-up where you want to send a verification
| email. Talking to a foreign API like Mailgun might be a 100 ms
| to multisecond (worst case scenario) operation -- why make the
| user wait on that? Instead, send it to the background, and give
| them a tight < 100 ms sign up experience that's so fast that
| for all intents and purposes, it feels instant.
| stouset wrote:
| GP isn't taking umbrage with the concept of needing to
| offload work to a background process.
| hipadev23 wrote:
| > Wanting to offload heavy work to a background job is
| absolute as old of a best practice as exists in modern
| software engineering.
|
| Yes. I am intimately familiar with background jobs. In fact
| I've been using them long enough to know, without hesitation,
| that you don't use a relational database as your job queue.
| toolz wrote:
| as far as I'm aware the most popular job queue library in
| elixir depends on postgres and has performance
| characteristics that cover the vast majority of background
| processing needs I've come across.
|
| I wonder maybe if you've limited yourself by assuming
| relational DBs only have features for relational data. That
| isn't the case now and really hasn't been the case for
| quite some time now.
| qaq wrote:
| Postgres based job queues work fine if you have say 10K
| transaction per second and jobs on average do not take
| significant time to complete (things will run fine on
| fairly modest instance). They also give guarantees that
| traditional job queues do not.
| lazyant wrote:
| > I've been using them long enough to know, without
| hesitation, that you don't use a relational database as
| your job queue.
|
| I'm also very familiar with jobs and I have used the usual
| tools like Redis and RMQ, but I wouldn't make a blanket
| statement like that. There are people using RDBS as queues
| in prod so we have some counter-examples. I wouldn't mind
| at all to get rid of another system (not just one server
| but the cluster of RMQ/Redis you need for HA). If there's a
| big risk in using pg as backend for a task queue, I'm all
| ears.
| teraflop wrote:
| It's not strange at all to me. The job is "transactional" in
| the sense that it _depends_ on the transaction, and should be
| triggered iff the transaction commits. That doesn 't mean it
| should run _inside_ the transaction (especially since long-
| running transactions are terrible for performance).
|
| Passing around the job's data separately means that now you're
| storing two copies, which means you're creating a point where
| things can get out of sync.
| hipadev23 wrote:
| > should be triggered iff the transaction commits
|
| Agreed. Which is why the design doesn't make any sense.
| Because in the scenario presented they're starting a job
| during a transaction.
| j45 wrote:
| Maybe it's not designed for that or all use cases and that
| can make sense.
|
| Personally, I need long running jobs.
| teraflop wrote:
| I don't understand what you mean. The job is "created" as
| part of the transaction, so it only becomes _visible_ (and
| hence eligible to be executed) when the transaction
| commits.
| Chris911 wrote:
| The job is queued as part of the transaction. It is
| executed by a worker outside the scope of the transaction.
| eximius wrote:
| That part is somewhat poorly explained. That is a
| motivating example of why having your job queue system be
| separate from your system of record can be bad.
|
| e.g.,
|
| 1. Application starts transaction 2. Application updates DB
| state (business details) 3. Application enqueues job in
| Redis 4. Redis jobworkers pick up job 5. Redis jobworkers
| error out 6. Application commits transaction
|
| This _motivates_ placing the jobworker state in the same
| transaction whereas non-DB based job queues have issues
| like this.
| qaq wrote:
| Job is not dependent on extant transaction. The bookkeeping of
| job state runs in the same transaction as your domain state
| manipulation so you will never get into situation where job
| domain mutation commited but job state failed to update to
| complete.
| maherbeg wrote:
| I think you may be misunderstanding the design here. The
| transaction for initiating the job is only for queuing. The
| dequeue and execution of the job happens in a separate process.
|
| The example on the home page makes this clear where a user is
| created and a job is created at the same time. This ensures
| that the job is queued up with the user creation. If any parts
| of that initial transaction fails, then the job queuing doesn't
| actually happen.
| latchkey wrote:
| If I was going to do my own Job Queue, I'd implement it more like
| the GCP Tasks [0].
|
| It is such a better model for the majority of queues. All you're
| doing is storing a message, hitting an HTTP endpoint and deleting
| the message on success. This makes it so much easier to scale,
| reason, and test task execution.
|
| Update: since multiple people seem confused. I'm talking about
| the implementation of a job queue system, not suggesting that
| they use the GCP tasks product. That said, I would have just used
| GCP tasks too (assuming the usecase dictated it, fantastic and
| rock solid product.)
|
| [0] https://cloud.google.com/tasks
| brandur wrote:
| There's a lot to be said about the correctness benefits of a
| transactional model.
|
| The trouble with hitting an HTTP API to queue a task is: what
| if it fails, or what if you're not sure about whether it
| failed? You can continue to retry in-band (although there's a
| definite latency disadvantage to doing so), but say you
| eventually give up, you can't be sure that no jobs were queued
| which you didn't get a proper ack for. In practice, this leads
| to a lot of uncertainty around the edges, and operators having
| to reconcile things manually.
|
| There's definite scaling benefits to throwing tasks into
| Google's limitless compute power, but there's a lot of cases
| where a smaller, more correct queue is plenty of power,
| especially where Postgres is already the database of choice.
| latchkey wrote:
| > what if it fails, or what if you're not sure about whether
| it failed?
|
| This is covered in the GCP Tasks documentation.
|
| > There's definite scaling benefits to throwing tasks into
| Google's limitless compute power, but there's a lot of cases
| where a smaller, more correct queue is plenty of power,
| especially where Postgres is already the database of choice.
|
| My post was talking about what I would implement if I was
| doing my own queue, as the authors were. Not about using GCP
| Tasks.
| politician wrote:
| Do you know that brandur's been writing about Postgres job
| queues since at least 2017? Cut him some slack.
|
| https://brandur.org/job-drain
|
| https://news.ycombinator.com/item?id=15294722
| latchkey wrote:
| "I'm into effective altruism and created the largest
| crypto exchange in the world. Cut me some slack."
|
| No, we don't operate like that. Call me out when I'm
| wrong technically, but don't tell me that because someone
| is some sort of celebrity that I should cut them some
| slack.
|
| Everything he pointed out is literally covered in the GCP
| Tasks documentation.
|
| https://cloud.google.com/tasks/docs/dual-overview
|
| https://cloud.google.com/tasks/docs/common-pitfalls
| robertlagrant wrote:
| > No, we don't operate like that. Call me out when I'm
| wrong technically
|
| You're being "called out" (ugh) incredibly politely
| mostly because you were being a bit rude; "tell me X
| without telling me" is just a bit unpleasant, and totally
| counterproductive.
|
| > because someone is some sort of celebrity that I should
| cut them some slack.
|
| No one mentioned a celebrity. You're not railing against
| the power of celebrity here; just a call for politeness.
|
| > Everything he pointed out is literally covered in the
| GCP Tasks documentation.
|
| Yes, e.g. as pitfalls.
| latchkey wrote:
| Sure, updated my comment to be less rude.
| bgentry wrote:
| 2015, even :) https://brandur.org/postgres-queues
| andrewstuart wrote:
| HTTP APIs are ideal for message queues with Postgres.
|
| The request to get a message returns a token that identifies
| this receive.
|
| You use that token to delete the message when you are done.
|
| Jobs that don't succeed after N retries get marked as dead
| and go into the dead letter list.
|
| This the way AWS SQS works, it's tried and true.
| jbverschoor wrote:
| >> Timeouts: for all HTTP Target task handlers the default
| timeout is 10 minutes, with a maximum of 30 minutes.
|
| Good luck with a long running batch.
| latchkey wrote:
| If you're going to implement your own queue, you can make it
| run for however long you want.
|
| Again, I'm getting downvoted. The whole point of my comment
| isn't about using GCP Tasks, it is about what I would do if I
| was going to implement my own queue system like the author
| did.
|
| By the way, that 30 minute limitation can be worked around
| with checkpoints or breaking up the task into smaller chunks.
| Something that isn't a bad idea to do anyway. I've seen long
| running tasks cause all sorts of downstream problems when
| they fail and then take forever to run again.
| jbverschoor wrote:
| Well you can't really.. If you're gonna use HTTP and expect
| a response, you're gonna be in for a fun ride. You'll have
| to go deal with timeout settings for: -
| http libraries - webservers - application
| servers - load balancers - reverse proxy
| servers - the cloud platform you're running on
| - waf
|
| It might be alright for smaller "tasks", but not for
| "jobs".
| latchkey wrote:
| Have you ever used Cloud Tasks?
| victorbjorklund wrote:
| Looks great. For people wondering about wether postgres really is
| a good choice for a job queue I can recommend checking out Oban
| in Elixir that has been running in production for many years:
| https://github.com/sorentwo/oban
|
| Benchmark: peaks at around 17,699 jobs/sec for one queue on one
| node. Probably covers most apps.
|
| https://getoban.pro/articles/one-million-jobs-a-minute-with-...
| bgentry wrote:
| Oban is fantastic and has been a huge source of inspiration for
| us, showing what is possible in this space. In fact I think
| during my time at Distru we were one of Parker's first
| customers with Oban Web / Pro :)
|
| We've also had a lot of experience with with other libraries
| like Que ( https://github.com/que-rb/que ) and Sidekiq
| (https://sidekiq.org/) which have certainly influenced us over
| the years.
| sorentwo wrote:
| The very first paying Pro customer, as a matter of fact =)
|
| You said back then that you planned on pursuing a Go client;
| now, four years later, here we are. River looks excellent,
| and the blog post does a fantastic job explaining all the
| benefits of job queues in Postgres.
| bgentry wrote:
| Hi HN, I'm one of the authors of River along with Brandur. We've
| been working on this library for a few months and thought it was
| about time we get it out into the world.
|
| Transactional job queues have been a recurring theme throughout
| my career as a backend and distributed systems engineer at
| Heroku, Opendoor, and Mux. Despite the problems with non-
| transactional queues being well understood I keep encountering
| these same problems. I wrote a bit about them here in our docs:
| https://riverqueue.com/docs/transactional-enqueueing
|
| Ultimately I want to help engineers be able to focus their time
| on building a reliable product, not chasing down distributed
| systems edge cases. I think most people underestimate just how
| far you can get with this model--most systems will never outgrow
| the scaling constraints and the rest are generally better off not
| worrying about these problems until they truly need to.
|
| Please check out the website and docs for more info. We have a
| lot more coming but first we want to iron out the API design with
| the community and get some feedback on what features people are
| most excited for. https://riverqueue.com/
| tombh wrote:
| At the bottom of the page on riverqueue.com it appears there's
| a screenshot of a UI. But I can't seem to find any docs about
| it. Am I missing something or is it just not available yet?
| mosen wrote:
| Looks like it's underway:
|
| > We're hard at work on more advanced features including a
| self-hosted web interface. Sign up to get updates on our
| progress.
| bgentry wrote:
| The UI isn't quite ready for outside consumption yet but it
| is being worked on. I would love to hear more about what
| you'd like to see in it if you want to share.
| fithisux wrote:
| An Airflow for Gophers?
| codegeek wrote:
| If you could build a UI similar to Hangire [0] or Laravel
| Horizon [1], that would be awesome.
|
| [0] https://hangfire.io
|
| [1] https://github.com/laravel/horizon
| cloverich wrote:
| So excited y'all created this. Through a few job changes I've
| been exposed to the most popular background job systems in
| Rails, Python, JS, etc, and have been shocked at how under
| appreciated their limitations are relative to what you get out
| of the box with relational systems. Often I see a lot of DIY
| add-ons to help close the gaps, but its a lot of work and often
| still missing tons of edge cases and useful functionality. I
| always felt going the other way, starting w/ a relational db
| where many of those needs are free, would make more sense for
| most start-ups, internal tooling, and smaller scale businesses.
|
| Thank you for this work, I look forward to taking it for a
| (real) test drive!
| dangoodmanUT wrote:
| How do you look at models like temporal.io (service in front of
| DB) and go-workflows (direct to DB) in comparison? It seems
| like this is more a step back towards that traditional queue
| like asynq is, which is where the industry is leaving from to
| the model of temporal
| bgentry wrote:
| I don't think these approaches are necessarily mutually
| exclusive. There are some great things that can be layered on
| top of the foundation we've built, including workflows. The
| best part about doing this is that you can maintain full
| transactionality within a single primary data store and not
| introduce another 3rd party or external service into your
| availability equation.
| endorphine wrote:
| How does this compare to https://github.com/vgarvardt/gue?
| gregwebs wrote:
| Or neoq. https://news.ycombinator.com/item?id=38352778
| csarva wrote:
| Not familiar with either project, but it seems gue is a fork
| of the authors previous project,
| https://github.com/bgentry/que-go
| bgentry wrote:
| Yes, there's a note in the readme to that effect although I
| don't think they bear much resemblance anymore. que-go was
| an experiment I hacked up on a plane ride 9 years ago and
| thought was worth sharing. I was never happy with its
| technical design: holding a transaction for the duration of
| a job severely limits available use cases and worsens bloat
| issues. It was also never something I intended to continue
| developing alongside my other priorities at the time and I
| should have made that clearer in the project's readme from
| the start.
| radicalbyte wrote:
| I don't know why people even use libraries in those languages -
| assuming you stick to a database engine you understand well
| then the (main)-database-as-queue pattern is trivial to
| implement. Any time spent writing code is quickly won back by
| not having to debug weird edge cases, and sometimes you can
| highly optimize what you're doing (for example it becomes easy
| to migrate jobs which are data-dominated to the DB server which
| can cut processing time by 2-3 orders of magnitude).
|
| It's particularly suited to use cases such background jobs,
| workflows or other operations which occur within your
| application and scales well enough for what 99.9999% of us will
| be doing.
| linux2647 wrote:
| Is there a minimum version of Postgres needed to use this? I'm
| having trouble finding that information in the docs
| surprisetalk wrote:
| I love PG job queues!
|
| They're surprisingly easy to implement in plain SQL:
|
| [1] https://taylor.town/pg-task
|
| The nice thing about this implementation is that you can query
| within the same transaction window
| rockwotj wrote:
| Agreed. Shortwave [1] is built completely on this, but with the
| added layer of having a leasing system that is per user on top
| of the tasks. So you only need to `SKIP LOCKED` to grab a
| lease, then you can grab as many tasks as you want and process
| them in bulk. It allows higher throughput of tasks, and also
| was required for the use case as the leases where tied to a
| user and tasks for a single user must be processed in order.
|
| [1]: https://www.shortwave.com/
| RedShift1 wrote:
| All these job queue implementations do the same thing right,
| SELECT ... FOR UPDATE SKIP LOCKED? Why does every programming
| language need its own variant?
| rockwotj wrote:
| To work with each language's drivers
| RedShift1 wrote:
| But it's the same thing every time. Turn autocommit off, run
| the SELECT, commit, repeat? Or am I missing something?
| bennyp101 wrote:
| Nice, I've been using graphile-worker [0] for a while now, and it
| handles our needs perfectly, so I can totally see why you want
| something in the go world.
|
| Just skimming the docs, can you add a job directly via the DB? So
| a native trigger could add a job in? Or does it have to go via a
| client?
|
| [0] https://worker.graphile.org/
| rubenfiszel wrote:
| Looks cool and thanks for sharing. Founder of windmill.dev, an
| open-source, extremely fast workflow engine to run jobs in
| ts,py,gosh whose most important piece, the queue, is also just
| rust + postgresql (and mostly the FOR UPDATE SKIP LOCKED).
|
| I'd be curious to compare performances once you guys are
| comfortable with that, we do them openly and everyday on:
| https://github.com/windmill-labs/windmill/tree/benchmarks
|
| I wasn't aware of the skip B-tree splits and the REINDEX
| CONCURRENTLY tricks. But curious what do you index in your jobs
| that use those. We mostly rely on the tag/queue_name (which has a
| small cardinality), scheduled_for, and running boolean which
| don't seem good fit for b-trees.
| JoshGlazebrook wrote:
| I didn't really see this feature, but I think another good one
| would be a way to schedule a future job that is not periodic. ie:
| "schedule job in 1 hr" to where it's either not enqueued or not
| available to be consumed until (at least) the schedule time.
| bgentry wrote:
| You've found an underdocumented feature, but in fact River does
| already do what you're asking for! Check out `ScheduledAt` on
| the `InsertOpts`:
| https://pkg.go.dev/github.com/riverqueue/river#InsertOpts
|
| I'll try to work this into the higher level docs website later
| today with an example :)
| andrewstuart wrote:
| The thing is there are now vast numbers of queue solutions.
|
| What's the goal for the project? Is it to be commercial? If so
| you face massive headwind because it's so incredibly easy to
| implement a queue now.
| youerbt wrote:
| Job queues in RDBMS are always so controversial here on HN, which
| is kinda sad. Not everybody needs insane scale or whatever else a
| dedicated solutions offer. Not to mention if you already have
| RDBMS laying around, you don't have to pay for extra complexity.
| sotraw wrote:
| If you are on Kafka already, there is an alternative to schedule
| a job without PG [0]
|
| [0] https://www.wgtwo.com/blog/kafka-timers/
| gregwebs wrote:
| We are looking right now to use a stable PG job queue built in
| Go. We have found 2 already existing ones:
|
| * neoq: https://github.com/acaloiaro/neoq
|
| * gue: https://github.com/vgarvardt/gue
|
| Neoq is new and we found it to have some features (like
| scheduling tasks) that were attractive. The maintainer has also
| been responsive to fixing our bug reports and addressing our
| concerns as we try it out.
|
| Gue has been around for a while and is probably serving its users
| well.
|
| Looking forward to trying out River now. I do wonder if neoq and
| river might be better off joining forces.
| sorentwo wrote:
| > Work in a transaction has other benefits too. Postgres' NOTIFY
| respects transactions, so the moment a job is ready to work a job
| queue can wake a worker to work it, bringing the mean delay
| before work happens down to the sub-millisecond level.
|
| Oban just went the opposite way, removing the use of database
| triggers for insert notifications and moving them into the
| application layer instead[1]. The prevalence of poolers like
| pgbouncer, which prevent NOTIFY ever triggering, and the extra db
| load of trigger handling wasn't worth it.
|
| [1]:
| https://github.com/sorentwo/oban/commit/7688651446a76d766f39...
| bojanz wrote:
| This looks like a great effort and I am looking forward to trying
| it out.
|
| I am a bit confused by the choice of the LGPL 3.0 license. It
| requires one to dynamically link the library to avoid GPL's
| virality, but in a language like Go that statically links
| everything, it becomes impossible to satisfy the requirements of
| the license, unless we ignore what it says and focus just on its
| spirit. I see that was discussed previously by the community in
| posts such as these [1][2][3]
|
| I am assuming that bgentry and brandur have strong thoughts on
| the topic since they avoided the default Go license choice of
| BSD/MIT, so I'd love to hear more.
|
| [1] https://www.makeworld.space/2021/01/lgpl_go.html [2]
| https://golang-nuts.narkive.com/41XkIlzJ/go-lgpl-and-static-...
| [3]
| https://softwareengineering.stackexchange.com/questions/1790...
| sorentwo wrote:
| The number of features lifted directly from Oban[1] is
| astounding, considering there isn't any attribution in the
| announcement post or the repo.
|
| Starting with the project's tagline, "Robust job processing in
| Elixir", let's see what else: - The same job
| states, including the British spelling for `cancelled` -
| Snoozing and cancelling jobs inline - The prioritization
| system - Tracking where jobs were attempted in an
| attempted_by column - Storing a list of errors inline on
| the job - The same check constraints and the same compound
| indexes - Almost the entire table schema, really -
| Unique jobs with the exact same option names - Table-backed
| leadership election
|
| Please give some credit where it's due.
|
| [1]: https://github.com/sorentwo/oban
| kamikaz1k wrote:
| ...how else do you spell cancelled? With one l? Wow, learning
| this on my keyboard as I type this...
| bgentry wrote:
| Hi Parker, I'm genuinely sorry it comes across as though we
| lifted this stuff directly from Oban. I do mean it when I say
| that Oban has been a huge inspiration, particularly around its
| technical design and clean UX.
|
| Some of what you've mentioned are cases where we surveyed a
| variety of our favorite job engines and concluded that we
| thought Oban's way was superior, whereas others we cycled
| through a few different implementations before ultimately
| apparently landing in a similar place. I'm not quite sure what
| to say on the spelling of "cancelled" though, I've always
| written it that way and can't help but read "canceled" like
| "concealed" in my head :)
|
| As I think I mentioned when we first chatted years ago this has
| been a hobby interest of mine for many years so when a new
| database queue library pops up I tend to go see how it works.
| We've been in a bit of a mad dash trying to get this ready for
| release and didn't even think about crediting the projects that
| inspired us, but I'll sync with Brandur and make sure we can
| figure out the right way to do that.
|
| I really appreciate you raising your concerns here and would
| love to talk further if you'd like. I just sent you an email to
| reconnect.
| throwawaymaths wrote:
| Ok so maybe just put it on the github readme? "Inspired by
| Oban, and X, and Y..."
|
| JFC One line of code you don't even have to test
| sa46 wrote:
| I wrote our own little Go and Postgres job queue similar in
| spirit. Some tricks we used:
|
| - Use FOR NO KEY UPDATE instead of FOR UPDATE so you don't block
| inserts into tables with a foreign key relationship with the job
| table. [1]
|
| - We parallelize worker by tenant_id but process a single tenant
| sequentially. I didn't see anything in the docs about that use
| case; might be worth some design time.
|
| [1]: https://www.migops.com/blog/select-for-update-and-its-
| behavi...
| molszanski wrote:
| Would love to see an SQLite driver
| chuckhend wrote:
| Awesome! Seems like this would be a lot easier to work with and
| perhaps more performant than Skye's pg-queue? Queue workload is a
| lot like OLTP, which IMO, makes Postgres great for it (but does
| require some extra tuning).
|
| Unlike https://github.com/tembo-io/pgmq a project we've been
| working on at Tembo, many queue projects still require you to run
| and manage a process external to the database, like a background
| worker. Or they ship as a client library and live in your
| application, which will limit the languages you can chose to work
| with. PGMQ is a pure SQL API, so any language that can connect to
| Postgres can use it.
___________________________________________________________________
(page generated 2023-11-20 23:00 UTC)