[HN Gopher] Multi-tenant queues in Postgres
       ___________________________________________________________________
        
       Multi-tenant queues in Postgres
        
       Author : abelanger
       Score  : 80 points
       Date   : 2024-04-18 15:26 UTC (7 hours ago)
        
 (HTM) web link (docs.hatchet.run)
 (TXT) w3m dump (docs.hatchet.run)
        
       | andrewstuart wrote:
       | I wonder if Postgres RBAC row based access control is another
       | solution to this.
        
       | ucarion wrote:
       | This is pretty fancy stuff! Sorry if I'm just not reading
       | carefully enough, but does this approach account for tenants
       | whose messages take longer to process, as opposed to a tenant
       | that sends a larger volume of messages?
        
         | abelanger wrote:
         | No, this assumes you've already split your tasks into quanta of
         | approximately the same duration - so all tasks are weighted
         | equally in terms of execution time. If each of the tasks have
         | different weights you might be looking at something like
         | deficit round-robin [1], which I've looked at implementations
         | for in RabbitMQ and we're thinking about how to implement in PG
         | as well [2].
         | 
         | [1]
         | https://www.cs.bu.edu/fac/matta/Teaching/cs655-papers/DRR.pd...
         | 
         | [2] https://nithril.github.io/amqp/2015/07/05/fair-consuming-
         | wit...
        
       | jperras wrote:
       | Is the `FOR UPDATE SKIP LOCKED` in the CTE necessary? Granted my
       | understanding of Postgres row-level locking and their interaction
       | with CTEs may be a bit lacking, but according to the docs[1]:
       | The sub-statements in WITH are executed concurrently with each
       | other and with the main query. Therefore, when using data-
       | modifying statements in WITH, the order in which the specified
       | updates actually happen is unpredictable. All the statements are
       | executed with the same snapshot (see Chapter 13), so they cannot
       | "see" one another's effects on the target tables.
       | 
       | 1. https://www.postgresql.org/docs/current/queries-
       | with.html#QU...
        
         | mslot wrote:
         | Read committed mode (PostgreSQL's default) can get pretty
         | funky.
         | 
         | If two transactions concurrently perform a SELECT (may be in a
         | CTE) followed by an UPDATE, then they might see and try to
         | update the same rows. That's often undesirable, for instance in
         | the example of a queue where messages are supposed to arrive
         | ~once. Serializable mode would "solve" the problem by letting
         | one transaction fail, and expects the application to retry or
         | otherwise deal with the consequences.
         | 
         | FOR UPDATE is a precision tool for working around read
         | committed limitations. It ensures rows are locked by whichever
         | transaction reads them first, such that the second reader
         | blocks and (here's the funky part) when the first transaction
         | is done it actually reads the latest row version instead of the
         | one that was in the snapshot. That's semantically a bit weird,
         | but nonetheless very useful, and actually matches how updates
         | work in PostgreSQL.
         | 
         | The biggest issue with SELECT..FOR UPDATE is that it blocks
         | waiting for concurrent updaters to finish, even if the rows no
         | longer match its filter after the update. The SKIP LOCKED
         | avoids all that by simply skipping the locked rows in the
         | SELECT. Semantically even weirder, but very useful for queues.
        
           | jperras wrote:
           | Ah, I see - the default transactional isolation level is what
           | I wasn't accounting for.
           | 
           | Thanks for the explanation! Very much appreciated.
        
       | plandis wrote:
       | At a previous job we did something similar but ended up having
       | workers first poll another table to determine which tenant to
       | query against. We called these items tokens and they represented
       | a finite amount of dedicated thread time for processing a
       | specific tenants' queue.
       | 
       | What this looked like was a worker thread would first query the
       | token table for which tenant to process eligible tasks from, and
       | then update the token to take a timed lock and during that time
       | would solely process eligible tasks from a specific tenant.
       | 
       | This has some nice properties:
       | 
       | 1. You can scale different tenants using different amounts of
       | tokens which means different but controlled amounts of thread
       | time.
       | 
       | 2. It allows for locality on your worker thread. Within a
       | specific tenant the processing was usually similar so any shared
       | resources could be cached and reused after polling for additional
       | eligible tasks from the tenants queue.
        
       | ndriscoll wrote:
       | You've got something wonky going on with that query plan for the
       | 2nd partition by attempt. In particular the seq scan on tasks to
       | do the `(tasks.id = eligible_tasks.id)` hash join seems odd. The
       | filter on queued status in `CTE eligible_tasks` (and not in the
       | last join) also seems weird. Is that plan for the same query in
       | the article?
       | 
       | If you add an index on `group_key, id WHERE status = 'queued'`
       | and remove the 2nd `WHERE tasks."status" = 'QUEUED'` (I believe
       | that's redundant?), you might get a better plan. You'd want
       | something to make sure you're not preferring one group_key as
       | well.
       | 
       | It's also useful in practice to have something like a worker_id
       | and timestamp and not just set status to RUNNING in case a worker
       | gets stuck/dies and you need to unclaim the work.
        
       ___________________________________________________________________
       (page generated 2024-04-18 23:00 UTC)