[HN Gopher] Building a fair multi-tenant queuing system
___________________________________________________________________
Building a fair multi-tenant queuing system
Author : tonyhb
Score : 97 points
Date : 2024-01-22 17:59 UTC (5 hours ago)
(HTM) web link (www.inngest.com)
(TXT) w3m dump (www.inngest.com)
| quantum_bits wrote:
| This is cool but pretty light on the details. Judging by how it
| reads, they create queues for each function, then create queues
| of queues for workers. It sounds similar to the QuiCK paper from
| Apple: https://www.foundationdb.org/files/QuiCK.pdf
| jawns wrote:
| This problem can be extended to systems like project management,
| where your workers are not computers but people.
|
| For instance: Suppose you have a team of six people, and you're
| being asked to tackle two projects for two clients, each of equal
| value to the business. One project will take two weeks to
| complete; another will take four months to complete.
|
| How do you determine a "fair" way to divide up those projects? Do
| you do one before the other? Do you divide up your team's
| capacity equally and do them in parallel?
|
| This is the same sort of queuing problem described in the post.
| So a solution to the computer queuing problem might have
| interesting implications for the broader problem.
| fabian2k wrote:
| There are plenty of articles about writing simple job queues in
| Postgres using SKIP LOCKED, and I personally like that version
| quite a bit for cases where you don't want to add another
| separate component just for the queue. But I haven't seen anyone
| addressing fairness in such a queue, e.g. for the multi-tenant
| case.
|
| Is there a simple way to get a very rough kind of fairness
| between tenants in this case? Doesn't have to be actually fair,
| just fair enough that no tenant gets fully stuck behind a large
| block of jobs from another tenant.
|
| The idea behind SKIP LOCKED is that you just get the first free
| job (potentially ordered by e.g. a priority column). Doing that
| in a way that switches between tenants doesn't seem
| straightforward. You could execute the query once per tenant, but
| that would be rather inefficient for many tenants. Is there a
| reasonably straightforward way to achieve this?
| dimitropoulos wrote:
| Skip locked seems to be the way that new queueing packages are
| going, though they don't solve fairness or multi-tenancy.
|
| It's nice to keep things simple, but how does that work at
| scale? If I have 1,000 users I don't think that these standard
| queueing systems can prevent one tenant from impacting another.
| It doesn't look like that's possible in SQS, either.
|
| On the other hand, is that even necessary? At what scale is
| this important?
| fabian2k wrote:
| I'm thinking about relatively simple systems and not "real"
| fairness. The case where I've seen this being a bit
| problematic is with such a queue in an application where jobs
| are sometimes inserted in large bursts. It doesn't actually
| matter much if they are truly fair, but in the cases where a
| large number of jobs is inserted at once the result is that
| for some tenants the queue doesn't seem to move at all.
|
| It's not a huge problem in that case because the regular
| priority system still puts the urgent jobs at the front of
| the queue. But it does sometimes cause confusion for someone
| looking at this from a single-tenant perspective because the
| queue looks stuck. There are other ways to fix the confusion,
| but it would be nice to have a very rudimentary fairness in
| this simple queue that would ensure that the queue iterates a
| bit over the tenants.
| vosper wrote:
| > On the other hand, is that even necessary? At what scale is
| this important?
|
| I can see it being helpful in front of a shared resource that
| doesn't have any way to control priority or fairness. My use
| case would be Elasticsearch. I don't want to block any
| searches from running if there's capacity, but also I don't
| want misbehaving clients to monopolise resources, and I want
| to make sure that all customers/users get to run their
| searches eventually, though I may prioritise some
| customers/users over others. And I may need my own internal
| jobs to run at a lower priority than a user search, but still
| within a guaranteed window (to avoid timeouts on the client
| side).
|
| Honestly for my use case this could be useful with just a
| handful of users - less than 10, even.
| whartung wrote:
| Seems to me a solution is to have a "generic" queue handler
| that will take any message, and then have tenant specific
| queue handlers.
|
| Now, mind, I'm not thinking of a dedicated handler for each
| tenant, rather one that can round robin, or rotate through
| the tenants to pull items specific to that tenant off the
| queue. This can be the "fairness" arbiter.
|
| With the Postgres SKIP LOCKED technique, you can be selective
| as to which messages you pull (you're not limited to the head
| of the queue, but can dig deep if you want).
|
| So, you have one general purpose "head of the line" handler,
| and then one (or several) that acts like the fellow at the
| Post Office "Anyone here just picking up mail?", and pull
| them out of the queue.
|
| Of course the generic version can just switch modes (every
| 1m, or 100 messages, go into "fair mode" for a short time),
| vs having different handlers.
| foofie wrote:
| > On the other hand, is that even necessary? At what scale is
| this important?
|
| That's the critical question. To that I had that there might
| be a myriad of articles on how someone used postgres to build
| a message queue instead of using a dedicated message broker,
| but there isn't a single article showing any form of
| benchmarking, performance tests, load test results, and more
| importantly report the correlation between load and
| performance degradation and eventual overload and resulting
| failure modes. This information is of critical importance to
| be able to design systems and pick which tool is suited for
| the job.
| xyzzy_plugh wrote:
| There is probably a better way (I no longer use Postgres for
| this) but I solved this in the past by evaluating priority
| after popping an item off the queue. If the item was low
| priority and there was higher priority work to do, or if a
| customer was over their quota, I would reschedule the work
| item.
|
| In that scenario I employed an earliest-deadline-first
| scheduler table so rescheduling was just re-inserting or
| updating with a new deadline.
| pphysch wrote:
| You have the full power of PostgreSQL at your disposal, so
| there are many ways you can effectively tackle this issue.
|
| - Decrement priority of all tenants jobs by 1 each time their
| job is executed, or increment other tenants' priorities (more
| ops but better behavior)
|
| - Maintain a separate tenant priority table and join it + use
| for ordering when fetching the next job
|
| And so on
| fabian2k wrote:
| I can think of plenty of inefficient ways to do this. The
| nice thing about the SKIP LOCKED queue is that it is very
| simple and pretty fast. Postgres just has to use an index to
| look at the jobs in some defined order and take the first one
| that isn't locked.
|
| The first option here would create an enormous amount of
| writes for each job fetched and likely slow it down to a
| crawl if enough jobs are in the queue.
| pphysch wrote:
| Many scheduling systems have a "time in queue increases
| priority" behavior; it is not an exotic proposition and
| could be implement efficiently in PostgreSQL.
|
| Having too many jobs in queue is a problem on its own that
| should be addressed. Each tenant should be rate-limited or
| have a reasonable cap on number of waiting jobs.
| karldcampbell wrote:
| As long as you don't need a strict priority and a worker can
| just grab the next waiting job: SELECT * FROM
| jobs ORDER BY RANDOM() LIMIT 1 SKIP LOCKED
|
| should do the trick
| gnud wrote:
| This will give more resources to tenants that schedule more
| jobs.
|
| If tenant A schedules 99 jobs and tenant B schedules 1 job,
| a "fair" algorithm would pick B's job either first or
| second, RANDOM() will not.
| paulddraper wrote:
| The solution is to maintain two queues.
|
| One queu is your normal queue.
|
| A second queue that is your tenant queue (one record for each
| tenant that has queued items).
|
| You select from your tenant queue, then select from your normal
| queue. All the while keeping your tenant invariant (one record
| for each tenant that has queued items).
|
| ---
|
| It's certainly more complex but no one said that fair was easy.
| 10000truths wrote:
| This depends largely on your requirements.
|
| * How well can you estimate job cost? How should overestimates
| and/or underestimates be treated?
|
| * Where, when and how often does resource contention occur
| between jobs? Is it from a single resource, or from multiple
| different resources?
|
| * What constraints or behaviors does the contended resource(s)
| possess (e.g. rate limits, pipelining, etc.) that would
| materially impact how you determine which job to run next?
|
| * What guarantees do you wish to make (e.g. execution latency,
| resource availability, etc.)?
|
| * How is resource usage quota determined? Are quotas updated
| frequently? Are there any circumstances under which you would
| like to allow users to exceed their quota (e.g. temporary burst
| workloads, idle periods of time)?
| nurettin wrote:
| First, query total time spent today for each tenant. Then, pick
| a job from the most unfortunate tenant's queue.
| smallnamespace wrote:
| > Is there a simple way to get a very rough kind of fairness
| between tenants in this case?
|
| Yes, since the SKIP LOCKED query is just a special sort of
| SELECT that skips over locked rows, you can use all the normal
| SQL tools such as ORDER BY to accomplish what you want.
|
| One way is to make the top row fair by randomizing over the
| tenant ID, for example SELECT itemid FROM
| queue ORDER BY md5(tenant_id ||
| current_timestamp::TEXT), item_id FOR UPDATE SKIP
| LOCKED LIMIT 1;
|
| Here we append the ID to the timestamp and then hash it, so
| each time the SELECT is run there is effectively a different
| ordering between tenant IDs, hence fairness.
|
| Note that this will eventually have performance implications
| for a deep queue since it forces a scan on the whole table.
| Adding an index on tenant_id may mitigate that, but as always
| do your own profiling.
| btown wrote:
| Out of curiosity, is Postgres usually able to use a b-tree
| index's representation of the columnar data for its (set of)
| columns to calculate a derived value for those column(s),
| then scan through that in-memory derived data? Certainly not
| as fast as the index lookup itself, but avoids needing to
| have the entire table in memory, I would hope?
|
| The documentation under
| https://www.postgresql.org/docs/16/indexes-types.html implies
| that it can be used to access the sorted values - but does
| that feed through the query planner?
| ctvo wrote:
| Is "step function" a term of art I'm just ignorant of in computer
| science? AWS Step Functions, sure. How it's used here? No idea.
|
| > It enables developers to write declarative step functions...
|
| > With thousands of customers running thousands of step functions
| containing parallel steps....
|
| > If you're familiar with step functions, you probably already
| see what's happening here...
| hot_gril wrote:
| "Step function" is sort of a standard term, yeah. Each step is
| some piece of code that runs at some point in the near future
| and persists its result for the next step to ingest. It could
| randomly fail (crashes, preemption, etc) before persisting, in
| which case the step is retried soon, so it better not have
| consequential side effects. Most likely you're interacting with
| external systems (like credit card processors) if you're using
| step functions in the first place, so idempotency is key.
|
| I've done basically this sorta manually at a smaller scale.
| Each step was a Postgres table, like "user order initiated,"
| "order confirmed," "order fulfilled," with automated steps
| (sometimes cronjobs) in between. Sometimes I wanted to expose
| that status to users. I could see using Inngest at a larger
| scale.
| rmbyrro wrote:
| I don't think that's exactly the meaning on this article in
| particular, but I've seen people using "step function" as
| synonym to a step in a Finite-State Machine.
|
| SWE technical terms are used very loosely, especially on blog
| posts like this.
| ykhli wrote:
| Great job abstracting away so much complexity!
|
| > each step is a code-level transaction backed by its own job in
| the queue. If the step fails, it retries automatically. Any data
| returned from the step is automatically captured into the
| function run's state and injected on each step call.
|
| This is one thing I've seen so many companies spending tons of
| time implementing themselves, and happens _everywhere_ -- no code
| apps, finance software, hospitals, anything that deals with
| ordering system...the list goes on.
|
| Glad I no longer need to write this from scratch!
| hot_gril wrote:
| A "nuclear option" I've seen in the wild is Queue-It. Puts your
| entire flow behind a queue that's user-visible, with a waiting
| list page. Then presumably your own site's logic doesn't do any
| queuing.
| danfarrelly wrote:
| How do they handle multi-tenancy for their users? Any post on
| that?
| hot_gril wrote:
| I don't know, but the requirements are different. They manage
| a single huge queue for your website (or part of it), and
| your own backend is expected to be able to handle some
| trivially small number of concurrent users. The most common
| use case is highly contentious online ticket sales, where
| users might even wait an hour in line. Latency isn't much of
| a concern.
| andrewstuart wrote:
| >> with Postgres and SKIP LOCKED always cropping up
|
| Note that SKIP LOCKED is available in Microsoft SQL Server and
| MySQL and I believe possibly also in Oracle and IBM DB/2.
|
| To the article, I suspect there will be alot of detail challenged
| here.
|
| There's talk in here about "creating queues" to handle the
| fairness problem. I might be misunderstanding what the hard bit
| is, but with a database backed queue you don't have to create a
| queue - queues are virtual depending on what messages are in
| there and what other information is available such as username or
| tag. if you want to be "fair" around username then round robin on
| that, or fair around some other bit of data then tag the message
| when it goes into the queue and when you are processing, round
| robin on that.
|
| What is the hard bit here, what am I missing?
| maerF0x0 wrote:
| btw "fair" is an entirely subjective and unproductive way to
| describe a system. There's simply properties and priorities. Many
| of us have learned fair to mean a great variance of things. From
| per capita equal load, to you get what you pay for, to anything
| goes just don't hurt anyone. In families we might see this as
| each child gets the same number of cookies. In capitalism fair is
| what you negotiate and agree to. The point isn't what is or isn't
| fair, but that there is no universal "fair".
|
| Should a user be able to jump the queue? What if they're your
| number one driver of revenue? Or they have a time-sensitive
| function w/ a deadline? Or if they've produced no load on the
| system in the past $TIME_UNIT. All of these are not fair, just
| properties to be designed into a product.
| 8n4vidtmkvmk wrote:
| Yeah I didn't understand what the author even meant in the
| article. How is first come first serve not fair?
| tonyhb wrote:
| Fair means that we should be giving users roughly the same
| capacity, or at least roughly the same chances to be worked
| on.
|
| In the case of this queue, where each letter represents a
| user:
|
| [A, Bx10000, C, D, E]
|
| We're being unfair to C, D, and E. Realistically, while
| working on B's jobs we should have some mechanism to know
| that latency for C, D, and E are increasing and that we can
| start assigning them to workers.
|
| Without that, latency for any step function you run is highly
| variable and impacted by other users. With multi-tenant
| fairness, latency is purely driven by how well we auto-scale
| and your own concurrency capacity.
|
| The post here is about multi-tenant fairness in particular,
| so the intent is that fairness is viewed from a multi-tenant
| / PaaS lense.
| maerF0x0 wrote:
| > or at least roughly the same chances to be worked on.
|
| Says who? What if B pays 10000x what ACDE do combined? What
| if B pays for that capacity?
|
| What if B was silent for the 10000 seconds prior? And is
| now bursting their load just like ACDE do at other times?
|
| There is no objective fair.
|
| Perhaps the only objectively unfair I could think of is if
| you paid for a service and it's not rendered, at all, ever.
| That's probably unfair. But everything else is a grey
| spectrum in between and depends on the agreed upon
| properties.
| sjansen wrote:
| The author defined fair in a pretty industry standard way:
|
| > One user should not be able to block another's work.
|
| A multi-tenant architecture implies over committing resources
| to achieve better economics. Once upon a time, decades ago,
| when computers were so expense that companies payed for batch
| processing instead of owning their own, FIFO was acceptable.
| (And for a tiny part of the market it still is.) But ever
| since the rise of time-sharing operating systems, users have
| come to expect short delays for small tasks. If you let one
| customer monopolize your capacity because they submitted a
| mountain of work before a second customer submits a handful,
| you probably won't have the second customer much longer.
| 0xbadcafebee wrote:
| [delayed]
| Nilithus wrote:
| I found Amazon's Builders Library to have a very insightful list
| of how to handle building multi tenant queue systems
| https://aws.amazon.com/builders-library/avoiding-insurmounta...
|
| I think one of the big tools is Shuffle Sharding. The article
| talks about standard sharding by itself as not being enough to
| provide robustness in multitenant queues. But Shuffle Sharding
| I.E. assigning users to virtual groups of underlying queues and
| enqueueing to the queue with the smallest size gets you pretty
| far. It can limit throughput for individual users but
| implementing some simple work stealing logic on the consumer
| helps make sure you keep your throughput up.
| extractionmech wrote:
| That extra layer of indirection, virtual queues, is key.
| jll29 wrote:
| Fair queuing in multi-tenant scenarios is also what describes
| what is going on in some financial trading systems: these queuing
| mechanisms are often opened up to regulators to demonstrate the
| fairness.
| brightball wrote:
| This honestly sounds like a job for the BEAM (Erlang/Elixir).
|
| The preemptive scheduler would cover almost every need mentioned
| in the article.
| darwin67 wrote:
| Which part of BEAM are you talking about? I know some cases can
| be solved by it already, but not almost every need.
|
| The fairness of the BEAM scheduler is not the same as multi-
| tenant fairness. I'm aware of lcnt in Erlang that helps with
| contention, but that will have a hit in throughput like any
| other locks.
|
| Unless I'm missing something?
| wilgertvelinga wrote:
| Sounds a lot like Azure Durable functions. Which we started using
| recently. Curious to know if inngest could also fit our use case.
| siliconc0w wrote:
| I think you want controls around:
|
| 0) Load - I'm too loaded right now, I'm not processing your
| message. Maybe another worker will. This might be enqueued back
| into the low latency queue.
|
| 1) Role Limit - You've sent too many messages in too short of a
| time, I'm dequeueing this into the high latency queue.
|
| 2) Cost Limit - You've sent too many _expensive_ messages, I 'm
| dequeing this into the high latency queue.
|
| 3). Deadletter- Your messages are failing to be processed. We're
| going to dequeue into deadletter and maybe we're going stop
| processing them for for now.
|
| So you have the normal low latency queue, the high latency
| backlog queue, and the deadletter queue. Ideally you have SLOs
| around end to end queue times and monitoring/alerting to let you
| and your senders know their expectations may be not being met
| (i.e 99% of messages are processed within n MS).
| 1oooqooq wrote:
| i thought RH had solved that. just write a systemd unit file with
| the cpu quotas. bam. done. /s?
| mbreese wrote:
| Another place to look for inspiration is the HPC world. Fairshare
| job queues have been active on multi tenant HPC clusters for
| decades. Each job typically has a priority value assigned to it
| based on the submission time, expected resources required, and
| how much of the cluster the account/user has used recently.
| Priorities can be continuously updated until the job is released
| for processing.
___________________________________________________________________
(page generated 2024-01-22 23:00 UTC)