[HN Gopher] Show HN: An SQS Alternative on Postgres
       ___________________________________________________________________
        
       Show HN: An SQS Alternative on Postgres
        
       Author : chuckhend
       Score  : 165 points
       Date   : 2024-05-09 12:21 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | conroy wrote:
       | I'm curious how this performs compared to River
       | https://riverqueue.com/
       | https://news.ycombinator.com/item?id=38349716
        
         | chuckhend wrote:
         | I think it would be tough to compare. There are client
         | libraries for several languages, but the project is mostly a
         | SQL API to the queue operations like send, read, archive,
         | delete using the same semantics as SQS/RSMQ.
         | 
         | Any language that can connect to Postgres can use PGMQ, whereas
         | it seems River is Go only?
        
       | thangngoc89 wrote:
       | I'm wondering if there are language agnostic queues where the
       | queue consumers and publishers could be written in different
       | languages?
        
         | rco8786 wrote:
         | That's exactly what this is. You write your own
         | consumer/publisher code however you want, and interact with the
         | queues via SQL queries.
        
         | SideburnsOfDoom wrote:
         | That's normal, yes. Name a queuing system, and with very few
         | exceptions it will have clients for a variety of languages.
         | 
         | It is also normal to exchange messages with contant as json,
         | protobuf or similar format, which again can be processed by any
         | language that aims to be widely used.
         | 
         | In fact, are there any queues that aren't language-agnostic? I
         | had the idea that ZeroMQ was a C/C++ only thing, but I checked
         | the docs and it's got the usual plethora of language bindings
         | https://zeromq.org/get-started/
         | 
         | So right now I can't name _any_ queue systems that are single
         | language. They 're _all_ aimed at interop.
        
           | thangngoc89 wrote:
           | Interesting. I've been looking at a much more simpler system
           | than Celery queue to publish jobs from Golang and consume
           | jobs from Python side (AI/ML stuff). This threads gave a lot
           | of names and links for further investigation.
        
             | anamexis wrote:
             | As mentioned, there are a plethora of queuing systems that
             | have cross platform clients.
             | 
             | If you're interested specifically in a background job
             | system, you may want to check out Faktory. It's Mike
             | Perham's language-agnostic follow-up to Sidekiq.
             | 
             | https://github.com/contribsys/faktory
        
       | bastawhiz wrote:
       | > Guaranteed "exactly once" delivery of messages to a consumer
       | within a visibility timeout
       | 
       | That's not going to be true. It might be true when things are
       | running well, but when it fails, it'll either be at most once or
       | at least once. You don't build for the steady state, you build
       | against the failure mode. That's an important deciding factor in
       | whether you choose a system: you can accept duplicates gracefully
       | or you can accept some amount of data loss.
       | 
       | Without reviewing all of the code, it's not possible to say what
       | this actually is, but since it seems like it's up to the
       | implementor to set up replication, I suspect this is an at-most-
       | once queue (if the client receives a response before the server
       | has replicated the data and the server is destroyed, the data is
       | lost). But depending on the diligence of the developer, it could
       | be that this provides no real guarantees (0-N deliveries).
        
         | MuffinFlavored wrote:
         | > That's not going to be true. It might be true when things are
         | running well, but when it fails, it'll either be at most once
         | or at least once.
         | 
         | Silly question as somebody not very deep in the details on
         | this.
         | 
         | It's not easy to make distributed systems idempotent across the
         | board (POST vs PUT, etc.)
         | 
         | Distributed rollbacks are also hard once you reach interacting
         | with 3rd party APIs, databases, cache, etc.
         | 
         | What is the trick in your "on message received handler" from
         | the queue to achieve "exactly once"? Some kind of "message hash
         | ID" and then you check in Redis if it has already been
         | processed either fully successfully, partially, or with
         | failures? That has drawbacks/problems too, no? Is it an
         | impossible problem?
        
           | samtheprogram wrote:
           | You don't achieve exactly once at the protocol/queue level,
           | but at the service/consumer level. This is why SQS guarantees
           | at-least-once.
           | 
           | It's generally what you described, but the term I've seen is
           | "nonce" which is essentially an ID on the message that's
           | unique, and you can check against an in-memory data store or
           | similar to see if you've already processed the message, and
           | simply return / stop processing the message/job if so.
        
             | mcqueenjordan wrote:
             | And just to add a small clarification since I had to double
             | take: this isn't exactly-once delivery (which isn't
             | possible), this is exactly-once processing. But even
             | exactly-once processing generally has issues, so it's
             | better to assume at least once processing as the thing to
             | design for and try to make everything within your
             | processing ~idempotent.
        
           | bastawhiz wrote:
           | > What is the trick in your "on message received handler"
           | from the queue to achieve "exactly once"?
           | 
           | There's no trick. The problem isn't "how to build a robust
           | enough system" it's "how can you possibly know whether the
           | message was successfully processed or not". If you have one
           | node that stores data for the queue, loss of that node means
           | the data is gone. You don't know you don't have it. If you
           | have multiple nodes running the queue and they stop talking
           | to each other, how do you know whether one of the other nodes
           | already allowed someone to process a particular message
           | (e.g., in the face of a network partition), or _didn 't_ let
           | someone else process the message (e.g., that node experienced
           | hardware failure).
           | 
           | At some point, you find yourself staring down the Byzantine
           | generals problem. You can make systems pretty robust (and
           | some folks have made extremely robust systems), but there's
           | not really a completely watertight solution. And in most
           | cases, the cost of watertightness simply isn't worth it
           | compared to just making your system idempotent or accepting
           | some reasonably small amount of data loss.
        
         | Justsignedup wrote:
         | To be honest. I like the at least once constraint. It causes
         | you to think of background jobs in a idempotent way and makes
         | for better designs. So ultimately I never found the removal of
         | it a mitzva.
         | 
         | Also if you don't have the constraint and say it works and you
         | need to change systems for any reason, now you gotta rewrite
         | your workers.
        
           | chuckhend wrote:
           | I agree. But it can be useful to have a guarantee, even for a
           | specified period of time, that the message will only be seen
           | once. For example, if the processing of that message is very
           | expensive, such as if that message results in API requests to
           | a very expensive SaaS service. It may be idempotent to
           | process that message more times than necessary, but doing so
           | may be cost prohibitive if you are billed per request. I
           | think this is a case where using the VT to help you only
           | process that message one time could help out quite a bit.
        
         | cjen wrote:
         | It will be true because of the "within a visibility timeout"
         | right? Of course that makes the claim way less interesting.
         | 
         | I took a peek at the code and it looks like their visibility
         | timeout is pretty much a lock on a message. So it's not exactly
         | once for any meaningful definition, but it does prevent the
         | same message from being consumed multiple times within the
         | visibility timeout.
        
           | bastawhiz wrote:
           | > it does prevent the same message from being consumed
           | multiple times within the visibility timeout.
           | 
           | ... When there is no failure of the underlying system. The
           | expectation of any queue is that messages are only delivered
           | once. But that's not what's interesting: what matters is what
           | happens when there's a system failure: either the message
           | gets delivered more than once, the message gets delivered
           | zero times, or a little of column A and a little of column B
           | (which is the worst of both worlds and is a bug). If you have
           | one queue node, it can fail and lose your data. If you have
           | multiple queue nodes, you can have a network partition. In
           | all cases, it's possible to not know whether the message was
           | processed or not _at some point_
        
             | thethimble wrote:
             | The Two Generals Problem is a great thought experiment that
             | describes how distributed consensus is impossible when
             | communication between nodes has the possibility of failing.
             | 
             | https://en.wikipedia.org/wiki/Two_Generals%27_Problem
        
         | bilekas wrote:
         | I'm curious about the same thing.. I use a deadletter queue in
         | sqs and even sns and here I can see an archiver but it's not
         | clear if I need to rollout my own deadletter behaviour here..
         | not sure I would be too confident in it either..
         | 
         | A nice novel project here but I'm a bit skeptical of its
         | application for me at least.
        
           | chuckhend wrote:
           | pgmq.archive() gives us an API to retain messages on your
           | queue, its an alternative to pgmq.delete(). For me as a long-
           | time Redis user, message retention was always important and
           | was always extra work to implement.
           | 
           | DLQ isn't a built-in feature to PGMQ yet. We run PGMQ our
           | SaaS at Tembo.io, and the way we implement the DLQ is by
           | checking the message's read_ct value. When it exceeds some
           | value, we send the message to another queue rather than
           | processing it. Successfully processed messages end up getting
           | pgmq.archive()'d.
           | 
           | Soon, we will be integrating https://github.com/tembo-
           | io/pg_tier into pgmq so that the archive table is put
           | directly into cloud storage/S3.
        
         | chuckhend wrote:
         | If the message never reaches the queue (network error, database
         | is down, app is down, etc), then yes that is a 0 delivery
         | scenario. Once the message reaches the queue though, it is
         | guaranteed that only a single consumer can read the message for
         | the duration of the visibility timeout. FOR UPDATE guarantees
         | only a single consumer can read the record, and the visibility
         | timeout means we don't have to hold a lock. After that
         | visibility timeout expires, it is an at-least-once scenario.
         | Any suggestion for how we could change the verbiage on the
         | readme to make that more clear?
        
           | Fire-Dragon-DoL wrote:
           | You are talking about distributed systems, nobody expects to
           | read "exactly once" delivery. If I read that on the docs, I
           | consider that a huge red flag.
           | 
           | And the fact is that what you describe is a performance
           | optimization, I still have to write my code so that it is
           | idempotent, so that optimization does not affect me in any
           | other way, because exactly once is not a thing.
           | 
           | All of this to say, I'm not even sure it's worth mentioning?
        
           | cryptonector wrote:
           | As u/Fire-Dragon-DoL says, you can have an exactly-once
           | guarantee within this small system, but you can't extend it
           | beyond that. And even then, I don't think you can have it. If
           | the DB is full, you can't get new messages and they might get
           | dropped, and if the threads picking up events get stuck or
           | die then messages might go unprocessed and once again the
           | guarantee is violated. And if processing an event requires
           | having external side-effects that cannot be rolled back then
           | you're violating the at-most-once part of the exactly-once
           | guarantee.
        
             | bastawhiz wrote:
             | > you can have an exactly-once guarantee within this small
             | system
             | 
             | It's actually harder to do this in a small system. I submit
             | a message to the queue, and it's saved and acknowledged.
             | Then the hard drive fails before the message is requested
             | by a consumer. It's gone. Zero deliveries, or "at most
             | once".
             | 
             | > If the DB is full, you can't get new messages and they
             | might get dropped
             | 
             | In this case, I'd expect the producer to receive a failure,
             | so technically there's nothing to deliver.
             | 
             | > if the threads picking up events get stuck or die
             | 
             | While this obviously affects delivery, this is a
             | concurrency bug and not a fundamental design choice. The
             | failures folks usually refer to in this context are ones
             | that are outside of your control, like hardware failures or
             | power outages.
        
           | bastawhiz wrote:
           | > Once the message reaches the queue though, it is guaranteed
           | that only a single consumer can read the message for the
           | duration of the visibility timeout.
           | 
           | But does the message get persisted and replicated before
           | _any_ consumer gets the message (and after the submission of
           | the message is acked)? If it 's a single node, the answer is
           | simply "no": the hard drive can melt before anyone reads the
           | message and the data is lost. It's not "exactly once" if
           | nobody gets the message.
           | 
           | And if the message is persisted and replicated, but there's
           | subsequently a network partition, do multiple consumers get
           | to read the message? What happens if writing the confirmation
           | from the consumer fails, does the visibility timeout still
           | expire?
           | 
           | > After that visibility timeout expires, it is an at-least-
           | once scenario.
           | 
           | That's not really what "at least once" refers to. That's
           | normal operation, and sets the normal expectations of how the
           | system should work under normal conditions. What matters is
           | what happens under _abnormal_ conditions.
        
         | cryptonector wrote:
         | Quite. The problem with a visibility timeout is that there can
         | be delays in handling a message, and if the original winner
         | fails to do it in time then some other process/thread will,
         | yes, but if processing has external side-effects that can't be
         | undone then the at-most-once guarantee will definitely be
         | violated. And if you run out of storage space and messages have
         | to get dropped then you'll definitely violate the at-least-once
         | guarantee.
        
       | pyuser583 wrote:
       | What advantages does this have over RabbitMQ?
       | 
       | My experience is Postgres queuing makes sense you must extract or
       | persist in the same Postgres instance.
       | 
       | Otherwise, there's no advantage over standard MQ systems.
       | 
       | Is there something I don't know.
        
         | vbezhenar wrote:
         | One big advantage of using queue inside DB is that you can
         | actually use queue operations in the same transaction as your
         | data operations. It makes everything incredibly simpler when it
         | comes to failure modes.
         | 
         | IMO 90% of software which uses external queues is buggy when it
         | comes to edge cases.
        
           | zbentley wrote:
           | A fair point. If you do need an external queue for any reason
           | (legacy/already have one, advanced routing semantics,
           | integrations for external stream processors, etc.) the
           | "Transactional Outbox" pattern provides a way to have your
           | cake and eat it too here--but only for produce operations.
           | 
           | In this pattern, publishers write to an RDBMS table on
           | publish, and then best-effort publish to the message broker
           | after RDBMS transaction commit, deleting the row on publish
           | success (optionally doing this in a failure-
           | swallowing/background-threaded way). An external scheduled
           | job polls the "publishes" table and republishes any rows that
           | failed to make it to the message broker later on. When
           | coupled with inbound message deduplication (a feature many
           | message brokers now support to some degree) and/or consumer
           | idempotency, this is a pretty robust way to reduce the
           | reliability hit of an external message broker being in your
           | transaction processing path.
           | 
           | It's not a panacea, in that it doesn't help with
           | transactional processing/consumption and imposes some extra
           | DB load, but is fairly easy to adopt in an ad-hoc/don't-have-
           | to-rewrite-the-whole-app way.
           | 
           | https://microservices.io/patterns/data/transactional-
           | outbox....
        
           | ilkhan4 wrote:
           | Sure, not having to spin up a separate server is nice, but
           | this aspect is underappreciated, imo. We eliminated a whole
           | class of errors and edge cases at my day job just by
           | switching the event enqueue to the same DB transaction as the
           | things that triggered them. It does create a bottleneck at
           | the DB, but as others have commented, you probably aren't
           | going to need the scalability as much as you think you do.
        
         | borplk wrote:
         | (Not the author)
         | 
         | The advantage of using your DB as a queue is that a traditional
         | DB is easier to interact with (for example using SQL queries to
         | view or edit the state of your queue).
         | 
         | In most business applications the message payload is a "job_id"
         | pointing to a DB table so you always need and have to go back
         | to the database to do something useful anyway. With this setup
         | it's one less thing to worry about and you can take full
         | advantage of SQL and traditional database features.
         | 
         | The only downside and bottleneck of having your DB act as a
         | queue is if the workers processes are hitting the DB too
         | frequently to reserve their next job.
         | 
         | Most applications will not reach the level of scale for that to
         | be a problem.
         | 
         | However if it does become a problem there is an elegant
         | solution where you can continously populate a real queue by
         | querying the DB and putting items in it ("feeder process"). Now
         | you can let the workers reserve their jobs from the real queue
         | so the DB is not being hit as frequently.
         | 
         | The workers will still interact with the DB as part of doing
         | their work of course. However they will not ask their "give me
         | my next job ID" question to the DB. They get it from the real
         | queue which is more efficient for that kind of QPOP operation.
         | 
         | This solution has the best of both worlds you get the best
         | features of something like Postgres to be the storage backend
         | for your jobs without the downside of hammering the DB to get
         | the next available job (but in general the DB alone can scale
         | quite well for 95% of the businesses out there).
        
         | chuckhend wrote:
         | Simplicity is one of the reasons we started this project. IMO,
         | far less maintenance overhead to running Postgres compared to
         | RabbitMQ, especially if you are already running Postgres in
         | your application stack. If PGMQ fits your requirements, then
         | you do not need to introduce a new technically.
         | 
         | There's definitely use cases where PGMQ wont compare to
         | RabbitMQ, or Kafka, though.
        
           | ethagnawl wrote:
           | > There's definitely use cases where PGMQ wont compare to
           | RabbitMQ, or Kafka, though.
           | 
           | I'd be curious to know more about which sorts of use cases
           | fall into this category.
        
             | chuckhend wrote:
             | PGMQ doesn't give you a way to deliver the same message to
             | concurrent consumers the same way that you can with Kafka
             | via consumer groups.
             | 
             | To get this with PGMQ, you'd need to do something like
             | creating multiple queues, then send messages to all the
             | queues within a transaction. e.g. `begin;
             | pgmq.send('queue_a'...); pgmq.send('queue_b'...); commit;`
        
         | 0x457 wrote:
         | The advantage is: if you already have PostreSQL running, you
         | don't have to add RabbitMQ or any other technology.
        
       | seveibar wrote:
       | People considering this project should also probably consider
       | Graphile Worker[1] I've scaled Graphile Worker to 10m daily jobs
       | just fine
       | 
       | The behavior of this library is a bit different and in some ways
       | a bit lower level. If you are using something like this, expect
       | to get very intimate with it as you scale- a lot of times your
       | custom workload would really benefit from a custom index and it's
       | handy to understand how the underlying system works.
       | 
       | [1] https://worker.graphile.org/
        
         | philefstat wrote:
         | have also used/introduced this to several places I've worked
         | and it's been great each time. My only qualm is it's not
         | particularly easy to modify the exponential back off timing
         | without hacky solutions. Have you ever found a good way to do
         | that?
        
         | valenterry wrote:
         | Is there something like that in the jvm world?
        
           | RedShift1 wrote:
           | You don't need anything specific. SELECT ... FROM queue FOR
           | UPDATE SKIP LOCKED is the secret sauce.
        
             | valenterry wrote:
             | Would be nice to get some goodies for free, like overview,
             | pausing, state, statistics etc. :-)
        
           | wmfiv wrote:
           | https://www.jobrunr.io//en/ seems to be popular at the
           | moment.
        
       | rco8786 wrote:
       | This is neat. Would be cool if there was support for a dead
       | letter or retry queue. The idea of deleting an event
       | transactionally with the result of processing said event is
       | pretty nice.
        
         | jpambrun wrote:
         | Retries are baked in as message re-becomes visible after a
         | configurable time. For dead letter you can move to another
         | queue on the nth retry.
        
           | chuckhend wrote:
           | This is exactly how we do this in our SaaS at Tembo.io. We
           | check read_ct, and move the message if >= N. I think it would
           | be awesome if this were a built-in feature though.
        
       | ltbarcly3 wrote:
       | Another one of these! It's interesting how many times this has
       | been made and abandoned and made again.
       | 
       | https://wiki.postgresql.org/wiki/PGQ_Tutorial
       | https://github.com/florentx/pgqueue https://github.com/cirello-
       | io/pgqueue
       | 
       | Hundreds of them! We have a home grown one called PGQ at work
       | here also.
       | 
       | It's a good idea and easy to implement, but still valuable to
       | have implemented already. Cool project.
        
         | mattbillenstein wrote:
         | Ha, I wrote one too - https://github.com/mattbillenstein/pg-
         | queue/blob/main/pg-que...
         | 
         | Roughly follows the semantics of beanstalkd which I used once
         | upon a time and quite liked.
        
         | arecurrence wrote:
         | Yeah, I've written a few of these and should probably release a
         | package at some point but each version has been somewhat domain
         | specific.
         | 
         | The last time we measured an immediate 99% performance
         | improvement over SNS+SQS. It was so dramatic that we were able
         | to reduce job resources simply due to the queue implementation
         | change.
         | 
         | There's a lot of useful and almost trivial features you can
         | throw in as well. SQS hasn't changed much in a long time.
        
       | dostoevsky013 wrote:
       | I'm not sure what are the benefits for the micro service
       | architecture. Do you expect other services/domains to connect to
       | your database to listen for events? How does it scale if you have
       | several micro services that need to publish events?
       | 
       | Or do you expect to a dedicated database to be maintained for
       | this queue? Worth comparing it with other queue systems that
       | persist messages and can help you to scale message processing
       | like kafka with topic partitions.
       | 
       | Found this article on how Revolut uses Postgres for events
       | processing: https://medium.com/revolut/recording-more-events-but-
       | where-w...
        
         | chuckhend wrote:
         | We talk a little bit in https://tembo.io/blog/managed-postgres-
         | rust about how we use PGMQ to run our SaaS at Tembo.io. We
         | could have ran a Redis instance and used RSMQ, but it
         | simplified our architecture to stick with Postgres rather than
         | bringing in Redis.
         | 
         | As for scaling - normal Postgres scaling rules apply.
         | max_connections will determine how many concurrent applications
         | can connect. The queue workload (many insert, read, update,
         | delete) is very OLTP-like IMO, and Postgres handles that very
         | well. We wrote some about dealing with bloat in this blog:
         | https://tembo.io/blog/optimizing-postgres-auto-vacuum
        
       | airocker wrote:
       | I think it will be better if you create events automatically
       | based on commit events in wal.
        
       | bdcravens wrote:
       | Note there are a number of background job processors for specific
       | languages/frameworks that use Postgresql as the broker. For
       | example GoodJob and the upcoming SolidQueue in Ruby and Rails.
        
       | cynicalsecurity wrote:
       | But why? Why not have a proper SQS service? What's the obsession
       | with Postgres?
        
         | poisonborz wrote:
         | no dependence on a third party
        
         | chuckhend wrote:
         | IMO, it is most valuable when you are looking for ways of
         | reducing complexity. For a lot of projects, if you're already
         | running Postgres then it is maybe not worth the added
         | complexity of bringing in another technology.
        
         | jpambrun wrote:
         | Why use a service that comes with lock-in and poor developer
         | experience when I can use the database I already have?
        
         | cryptonector wrote:
         | See https://news.ycombinator.com/item?id=40307454#40311843
        
       | RedShift1 wrote:
       | This seems like a lot of fluff for basically SELECT ... FROM
       | queue FOR UPDATE SKIP LOCKED? Why is is the extension needed when
       | all it does is run some management type SQL?
        
         | acaloiar wrote:
         | A similar argument can be made of many primitives and their
         | corresponding higher order applications.
         | 
         | Why build higher order concept Y when it's simply built on
         | primitive X?
         | 
         | Why build the C programming language when C compilers simply
         | generate assembly or machine code?
         | 
         | ---
         | 
         | A sensible answer is that new abstractions make lower level
         | primitives easier to manage.
        
       | andrewstuart wrote:
       | Seems an unusual choice that this does not have an HTTP
       | interface.
       | 
       | HTTP is really the perfect client agnostic super simple way to
       | interface with a message queue.
        
       | ComputerGuru wrote:
       | This is strictly polling, no push or long poll support?
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:00 UTC)