[HN Gopher] Queues don't fix overload (2014)
       ___________________________________________________________________
        
       Queues don't fix overload (2014)
        
       Author : akbarnama
       Score  : 166 points
       Date   : 2024-01-18 13:33 UTC (9 hours ago)
        
 (HTM) web link (ferd.ca)
 (TXT) w3m dump (ferd.ca)
        
       | throwawaaarrgh wrote:
       | (2014)
       | 
       | Fwiw, sinks have overflows: https://www.thespruce.com/the-
       | purpose-of-a-sink-overflow-271...
       | 
       | The computer equivalent is probably load shedding
        
         | cnity wrote:
         | The sink overflow goes to the same drain, so it doesn't really
         | work in this analogy! (Or it works very well, depending on your
         | perspective).
        
           | weego wrote:
           | The overflow goes to the same drain, but further down, past a
           | bottleneck that's designed to be there for a logical reason.
           | It's a good example of real world flow control considerations
           | vs paper theory.
        
             | tonyarkles wrote:
             | Continuing the analogy from the article though, the sink
             | overflow _doesn 't_ go past the red arrow in their
             | diagrams. It's there to solve the specific problem of "the
             | sink drain is plugged, either because you left the water
             | running with the plug in or because there's a ball of hair
             | immediately below the drain".
        
       | jusssi wrote:
       | Depends.
       | 
       | An overflowing queue that drops jobs from the front while backend
       | chugs along as fast as it can, is for many cases a better outcome
       | than the backend overloading and everything grinding to a halt.
       | 
       | Compare it to a physical service desk. The one that has a queue,
       | serves one person at a time, and people arriving will try again
       | another day if the queue is too long. The one without queue has
       | people fistfighting over who gets to go first, and no one ends up
       | getting service.
        
         | sparsely wrote:
         | That's loadshedding.
        
         | teeray wrote:
         | So it allows the dev team to blame the queue. Instead of "we
         | decided to drop requests," it's "oh look, the queue is dropping
         | requests."
        
       | User23 wrote:
       | So far as I know there is no theoretical alternative to load
       | shedding or increasing handling capacity if your average request
       | arrival rate is greater than your average request handling rate.
       | At least, not if you want to handle every accepted request using
       | a finite queue[1]. It would appear that with an unbounded queue
       | every request will eventually be handled, but with an unbounded
       | latency guarantee. Which appears equivalent to "grinds to a halt"
       | for sufficient n.
       | 
       | However, that may very well change with fair queueing. If
       | unbounded queue storage is available and you can prioritize
       | requests in a timely fashion, then instead of shedding excess
       | requests they can go into unbounded queue limbo instead while
       | still meeting the SLA for priority requests. I imagine there is
       | developed queueing theory for that.
       | 
       | [1] I have been taught here though that classical queuing theory
       | isn't adequate for all cases. I think it is here, but I will
       | gratefully accept correction if I'm wrong.
        
         | gpderetta wrote:
         | > It would appear that with an unbounded queue every request
         | will eventually be handled
         | 
         | Of course there is no such a thing as an unbounded queue.
        
         | thaumasiotes wrote:
         | > It would appear that with an unbounded queue every request
         | will eventually be handled, but with an unbounded latency
         | guarantee.
         | 
         | This might or might not be true. Depending on your strategy for
         | retrieving tasks from the queue, it is not necessarily the case
         | that every request will eventually be handled. FIFO will give
         | you that guarantee; a priority system won't.
        
         | Joker_vD wrote:
         | > If... you can prioritize requests in a timely fashion
         | 
         | Just want to point out that this still is processing -- and so
         | your sorting-proxy that would either route high-priority
         | requests forward for actual processing or put low-priority
         | requests into a queue with unbounded storage can still be
         | overloaded with incoming requests. And, of course, that queue
         | with unbounded storage need to be able to put all of those
         | incoming requests into the storage in timely manner.
        
           | User23 wrote:
           | Sure, and at some level everything is processing. Even
           | putting an object at the end of a FIFO queue has overhead.
           | 
           | I would be very hesitant to introduce a sorting proxy unless
           | the need for it had been thoroughly demonstrated though. If
           | your prioritization isn't lightweight enough to do in the
           | server it's probably too heavy. Also you might accomplish it
           | at the network level. Regardless your point stands that the
           | work has to be done somewhere, but that doesn't change the
           | fact that fair queuing is a good idea for the majority of
           | production services. It's absolutely great when a misbehaving
           | client only DoSes themselves and doesn't affect anyone else.
        
             | Joker_vD wrote:
             | > I would be very hesitant to introduce a sorting proxy
             | 
             | > fair queuing is a good idea for the majority of
             | production services
             | 
             | Huh? As you've said, fair queuing requires one to quickly
             | look at incoming messages, decide whether they are
             | high/low-priority and route accordingly (to the processing
             | server/to the queue). This is a sorting proxy, and so it
             | seems that having it in some form is required to implement
             | fair queuing. Right?
             | 
             | > It's absolutely great when a misbehaving client only
             | DoSes themselves and doesn't affect anyone else.
             | 
             | Yes, of course. But that's precisely the problem with
             | DoS/DDoS attacks: they're, inherently, the cases of a
             | client/clients refusing to obey the back pressure demands
             | from the upstream.
             | 
             | Over network, a misbehaving client affects also affects
             | their ISP at the very least, by the way.
        
         | LanceH wrote:
         | Their alternative:
         | 
         | "To make stuff usable, a proper idempotent API with end-to-end
         | principles in mind will make it so these instances of back-
         | pressure and load shedding should rarely be a problem for your
         | callers, because they can safely retry requests and know if
         | they worked."
         | 
         | Is to ignore requests and make the caller implement their own
         | queue, apparently.
        
           | xyzzy_plugh wrote:
           | It's turtles all the way down. At some point, it's _not
           | really_ a queue.
           | 
           | For example imagine a computer program to fetch the contents
           | of a list of URLs. Perhaps it makes some number of parallel
           | requests.
           | 
           | If the URLs are hard coded, e.g.:
           | fetch("https://news.ycombinator.com/")
           | fetch("https://bbc.co.uk/")
           | fetch("https://archive.org/")       ...
           | 
           | When the maximum concurrency is reached (which could simply
           | be 1) and all requests are blocking, then back-pressure is
           | applied. The callers wait before submitting more work.
           | 
           | In this example the queue is the code, and the program
           | counter acts represents the head of the queue. I wouldn't
           | regularly refer to that as a queue.
           | 
           | Perhaps instead it reads a list of entries from a file. The
           | queue's state can be represented by file offset. And so on.
           | 
           | Computers are by definition carry out sequences of
           | operations. A sequence can be viewed as a queue. Everything
           | is a queue.
           | 
           | So yes, ignore requests and make the caller implement their
           | own queue _is a completely correct take_ in a sense, but I
           | don't find it productive.
        
         | Sohcahtoa82 wrote:
         | > However, that may very well change with fair queueing.
         | 
         | It doesn't matter what kind of queueing system you put in
         | place. No matter how fancy it is, and even if the queue adds
         | negligible overhead, if your average request arrival rate is
         | higher than the average handling rate, your queue _will_ grow
         | unbounded. Then, for some requests, the system will appear to
         | grind to a halt.
         | 
         | There's no way around it. A priority queue just means that
         | high-priority requests get processed faster. Low-priority
         | requests might end up never getting handled.
         | 
         | Here's a concrete example. I've got a web service that requires
         | a login. I'm using bcrypt to hash passwords with a sufficiently
         | large work factor that I can only hash 5 passwords per second
         | and have a queue for the password hashing. If 6 people are
         | trying to login every second, then after 1 minute, I will have
         | received 360 login requests. 300 of them will have completed,
         | but the queue has grown to 60. Anybody trying to log in at that
         | point will have to wait a minimum of 12 seconds to get their
         | login processed.
         | 
         | But what if I also have users trying to sign up? I have to hash
         | their password, so those get added to the queue as well. If I
         | prioritize logins over signups, then under my scenario, _a
         | signup will never complete_.
         | 
         | There's really nothing that can be done. An unbounded queue
         | means unbounded latency. Eventually, a login request would
         | _have_ to get rejected just because the queue is full. Of
         | course, in the real world, this creates frustration for the
         | user. They 'll either try to login again, or give up.
        
           | tanveergill wrote:
           | Aperture (https://github.com/fluxninja/aperture) takes a
           | slightly opinionated take on queue size. Instead of defining
           | queue sizes, you put a timeout on each request which is the
           | amount of time the request is willing to wait in the queue.
           | Then we run a weighted fair queuing algorithm that ensures
           | relative allocation across workloads, e.g. 90% capacity for
           | critical requests. But the capacity is allowed to burst when
           | there is low demand.
        
         | tanveergill wrote:
         | I am in the same school of thought as you, fair queuing is the
         | most optimal solution while capacity is constrained. This is
         | the approach we took in Aperture, an open-source load
         | management system. Read more about our Scheduler which
         | implements a variant of WFQ (Weighted Fair Queuing) to ensure
         | desired capacity allocation across request workloads and SWFQ
         | (Stochastic Weighted Fair Queuing) to ensure fairness across
         | users within each request workload:
         | https://docs.fluxninja.com/concepts/scheduler
        
       | roenxi wrote:
       | The other thing to bear in mind about queues is that once they
       | start showing of symptoms of something being wrong, collapse
       | might be just around the corner or it might not be depending on
       | the nature of the load.
       | 
       | When congestion spikes start showing it is helpful to know some
       | queuing theory to estimate how close the situation is to eating
       | someone's weekend. Congestion collapses are an interesting time
       | because most people don't know queue theory or how to reason
       | using balance equations, it is possible to misdiagnose the
       | problem or waste a stressful few days trying to work out a
       | congestion situation by experiment.
        
         | domano wrote:
         | Hey, can you recommend something one might read to get up to
         | speed on queuing theory? I certainly am not aware of it, but
         | work with queues.
        
           | PaulHoule wrote:
           | Whenever I talk to operations research people about "how do I
           | learn X?" or "how do I calculate Y?" I usually get told to
           | write a Monte Carlo simulation despite there being a lot of
           | beautiful math involving stochastic processes, generating
           | functions and stuff like that. (Even if you are calculating
           | results in closed form it is still a slam dunk to have a
           | simulation to check the work except when you are dealing with
           | "exceptional event" distributions... That is, a Monte Carlo
           | simulation of a craps game will give you an accurate idea of
           | the odds in many N=10,000 samples, but simulating Powerball
           | takes more like N=1,000,000,000 samples.)
           | 
           | The single "uncommon sense" result you need to know about
           | queuing is
           | 
           | https://erikbern.com/2018/03/27/waiting-time-load-factor-
           | and...
           | 
           | that is, with random arrivals, a queue that has slightly less
           | than 100% utilization will grow stupendously long. People
           | look at a queue with less than 100% and often have a feeling
           | of moral disgust at the "waste" but if you care about the
           | experience of the customer and the reliability of the system
           | you don't let utilization get above about 80% or so.
        
             | tonyarkles wrote:
             | I learned about this stuff in grad school. The course
             | wasn't mandatory for everyone but my supervisor made it
             | mandatory for me due to the nature of the research I was
             | doing: "Computer Systems and Performance Evaluation". It
             | was basically focused on queuing theory and state space
             | modelling.
             | 
             | Reading through this whole discussion thread really makes
             | me want to dig up my old notes and whip up a blog post with
             | a Jupyter notebook or something that people can use to
             | really dig into this and start to grok what's happening
             | because a lot of it really isn't that intuitive until
             | you've been steeped in it for a while.
        
               | PaulHoule wrote:
               | If I were you I'd consider using
               | 
               | https://simpy.readthedocs.io/en/latest/
               | 
               | inside a Jupyter notebook.
        
               | tonyarkles wrote:
               | Hey that's awesome! Unfortunate that I'm going to get
               | even more confused now about whether I'm talking about
               | SimPy or SymPy but that's life :)
        
               | PaulHoule wrote:
               | Maybe use one for closed form, use their other for the
               | simulation that confirms it!
        
           | jph wrote:
           | Queueing theory introduction:
           | https://github.com/joelparkerhenderson/queueing-theory
        
           | lanstin wrote:
           | There are a billion books on abstract or basic queueing
           | theory (worth reading) but a really good modern paper on
           | distributed software implications is
           | https://pure.psu.edu/en/publications/metastable-failures-
           | in-...
        
           | ibejoeb wrote:
           | Also, in this context, not a bad idea to revisit the broader
           | topic of scheduling.
           | 
           | https://csc-knu.github.io/sys-
           | prog/books/Andrew%20S.%20Tanen...
        
           | roenxi wrote:
           | I refer to _Fundamentals of Queueing Theory_ by Gross,
           | Shortle, Thompson  & Harris.
           | 
           | Although Wikipedia is enough. As far as insights go the topic
           | is relatively simple, it is just bad practice to be re-
           | deriving the first 100 pages of an intro-to-queueing textbook
           | in an emergency.
           | 
           | 80% of the time it is enough to assume the process is an
           | M/M/1 queue or consider how the queue would perform relative
           | to an M/M/1 queue. M/M/1 queues are the analog to fitting a
           | straight line, simple & technically incorrect. It is good to
           | move through that part of the day without thinking.
        
       | kasey_junk wrote:
       | This is a weird article because it points out that queues don't
       | solve overload but neither do load shedding or back pressure.
       | 
       | All 3 techniques are just different trade offs on what to do in
       | the face of overload. All 3 have negative ramifications for the
       | users of the system. Load shedding reduces availability, back
       | pressure increases complexity and queues increase latency.
       | 
       | In "critical" systems you need all 3. And all 3 can be
       | overloaded. Frankly, your load shedding or back pressure system
       | is probably implemented on a queue one layer down the
       | abstraction.
        
         | robertlagrant wrote:
         | > Frankly, your load shedding or back pressure system is
         | probably implemented on a queue one layer down the abstraction.
         | 
         | If you're building an API, you don't care how your clients cope
         | with your backpressure. You just want to avoid overloading your
         | services, and queues will indeed not help you with that past a
         | certain point, whereas backpressure probably will. Or at least
         | you can scale backpressure much more cheaply than you can scale
         | queues.
        
           | lanstin wrote:
           | Teach your clients to handle 429 with gradual back
           | off/retries and it works well. Even teach your web UI to
           | degrade under partial failure of widget loads.
        
         | dijit wrote:
         | I think it's the threshold of all people to truly understand
         | that there's no solution to certain problems, only tradeoffs.
         | 
         | Queues are great, but can lead to catastrophic failure if you
         | don't have a good way of handling the queue, so making an
         | active choice about how you handle overload is part of
         | designing a robust and resilient system.
         | 
         | Trading off new requests for current requests is, in my
         | experience, a valid strategy for eCommerce for example. We
         | called it "quenching".
        
           | kqr wrote:
           | > Trading off new requests for current requests is, in my
           | experience, a valid strategy for eCommerce for example. We
           | called it "quenching".
           | 
           | I'm not sure in which direction the trade happens but it
           | sounds like you're dropping older requests in favour of
           | newer. I agree, this has worked well for me also.
           | Surprisingly often the oldest item in the queue is the one
           | for which service will be least valuable.
        
             | hinkley wrote:
             | I think the theory is that if half of your excess requests
             | come from two actors, then half of the consequences of
             | random dropping are felt by those two people. That's not
             | fair, but it's better than dropping all requests after you
             | get into trouble. Fairness is very difficult to achieve in
             | a distributed system, due to coordination delays.
        
           | smaudet wrote:
           | I think part of the article is just the author venting that
           | people make design decisions without understanding the
           | ramifications.
           | 
           | Operating a queue as a buffer would be absolutely _fine_ if
           | there were also service level agreements implicit in every
           | queue in operation - i.e. one queue reaches half capacity or
           | is experiencing explosive growth, leading to reactive
           | techniques such as spinning up larger queues /servers,
           | alerting stakeholders to the performance issue(s), and
           | possibly even dynamically rate-limiting end-users.
           | 
           | But this is a whole-system centric view of the design. What
           | they're specifically bemoaning is the asinine focus on
           | component-centric design - your widget/service performs
           | slowly? Throw a queue at it (and wonder why you broke
           | everything 100 features and 10x customer base later)!
           | 
           | For legitimately small problems, this is OK. But then, you
           | accept that you aren't scaling, and you bake it into your
           | design assumptions. And you sell that, explicitly, to the
           | customer "hey this will only work for 1 thousand/1 million
           | customers, if you want more we can talk".
        
         | 0xbadcafebee wrote:
         | The only real solution to overload (that is, the eventuality of
         | the system not having enough capacity), in modern systems, is
         | autoscaling. Nobody seems to talk about this, I guess because
         | it's taken for granted? But you can literally just keep adding
         | capacity now. We didn't really have that before the cloud; you
         | had the servers you bought, and maybe you'd rush to repurpose
         | some servers to add capacity. Now an algorithm just adds
         | servers magically when your load gets high.
         | 
         | Of course the system still has limits. But if you go from zonal
         | to regional to global autoscaling, and you architect
         | specifically to scale up each point in the system using
         | autoscaling, it kind of doesn't have a limit? (that any one
         | product would hit, at a global level)
         | 
         | In the past I have spun up a duplicate distributed system in
         | another region and just split the traffic between the two. IaC
         | is crazy.
        
           | manuelabeledo wrote:
           | > Nobody seems to talk about this, I guess because it's taken
           | for granted?
           | 
           | No, because it has a theoretical limit, same as queues, back
           | pressure, etc.
           | 
           | One cannot simply scale up indefinitely because it is not
           | profitable.
        
             | 0xbadcafebee wrote:
             | There's no product in the world that would hit a limit if
             | autoscaled globally on AWS. Sure, you could write an app
             | whose literal sole purpose is "take up all memory, CPU and
             | bandwidth, recursively, infinitely", but nobody is making
             | that product. Real products built today have a finite
             | amount of demand, and global cloud capacity is larger than
             | that.
             | 
             | You can't say what architecture is or isn't profitable in
             | general, business is more complicated than that. But
             | besides the realities of commerce, you can easily architect
             | a system such that the autoscaling cost is a fraction of
             | opex.
        
               | smaudet wrote:
               | > There's no product in the world that would hit a limit
               | if autoscaled globally on AWS > but nobody is making that
               | product
               | 
               | Nobody is making that product because nobody has money
               | for it...(or the demand to supply that money).
               | 
               | You can auto-scale without AWS, actually, its called
               | measuring server load and auto-ordering new machines when
               | your load increases (and keeping machines turned off on
               | standby for spikes).
               | 
               | Or you can go hybrid, and load-shed to a cloud...which
               | would be the smart thing to do.
        
               | vault_ wrote:
               | > Real products built today have a finite amount of
               | demand, and global cloud capacity is larger than that.
               | 
               | This isn't really true, and it's especially not true when
               | specialized hardware comes into play. If you have a "web-
               | scale" GPU workload, it's not unlikely that you'll hit
               | resource availability constraints from time to time. The
               | question isn't whether cloud capacity is larger than your
               | demand for a particular resource, it's whether cloud
               | capacity is larger than the aggregate peak demand for
               | that resource. Cloud providers aren't magic. They engage
               | in capacity planning, sometimes underestimate, and are
               | sometimes unable to actually procure as much hardware as
               | they want to.
        
               | manuelabeledo wrote:
               | > There's no product in the world that would hit a limit
               | if autoscaled globally on AWS.
               | 
               | While this may be true, autoscaling indefinitely is
               | _still_ not profitable.
               | 
               | > You can't say what architecture is or isn't profitable
               | in general, business is more complicated than that. But
               | besides the realities of commerce, you can easily
               | architect a system such that the autoscaling cost is a
               | fraction of opex.
               | 
               | Customer acquisition and retention cost more money than
               | it should.
               | 
               | I cannot count how many times I got an email from the CTO
               | asking to lower our cloud costs, while product and
               | marketing refuse to raise prices, making the whole
               | endeavor borderline unprofitable. You may then argue that
               | the product suffers from either bad architecture, or bad
               | pricing, or both, but the situation is far from uncommon.
        
               | lanstin wrote:
               | I have made and deployed pieces of that infinite
               | recursive. Most spectacularly by having infinite call
               | loops triggered. Had we had autoscaling rather than sharp
               | queue limits leading to load shedding it would have been
               | worse.
        
               | gilbetron wrote:
               | Not really true (I work for one that can, and does, hit
               | limits in AWS, let alone GCP/Azure), but the general
               | point is mostly true. You just might not actually get
               | what you want fast enough, however.
        
           | ahoka wrote:
           | Increasing capacity would be very stupid if doing so does not
           | also increase revenue.
        
           | secondcoming wrote:
           | That's fine until the issue lies with something that your
           | auto-scaled instances talk to, e.g. Redis, Scylla, SQL DB
        
           | secondcoming wrote:
           | That's fine until the issue lies with something that your
           | auto-scaled instances talk to, e.g. Redis, Scylla, SQL DB.
           | There are situations where auto-scaling to infinity makes
           | things far worse.
        
             | tonyarkles wrote:
             | 100% that was my first thought when reading the GP comment.
             | Autoscaling isn't a magic fix. Just like the article says,
             | you need to find the red arrow first to figure out where
             | the bottleneck is and whether or not that bottleneck is
             | actually something within the auto-scaling context or not.
             | You've pointed out a couple of good examples of potential
             | bottlenecks. Another possibility is a downstream 3rd-party
             | service. If you start auto-scaling because a downstream
             | service is struggling, you're just going to end up hitting
             | it even harder and essentially contributing to a DoS attack
             | against them.
             | 
             | No silver bullets and all that. Still need to do
             | engineering analysis to figure out what's actually
             | happening before you start shooting.
        
               | lanstin wrote:
               | Priority queues for your downstream calls to rate limit
               | them. A large enough cluster can take down most any data
               | store, except maybe spanner. Even then you could increase
               | latency a lot while it scales up.
        
           | jeremyjh wrote:
           | Autoscaling is not going to help if you are IO-bound in your
           | database. One point of the article is you have to identify
           | your bottleneck before you can make sensible design choices.
        
             | 0xbadcafebee wrote:
             | ....you can autoscale database read replicas, and write
             | nodes (master-master)....
        
               | jeremyjh wrote:
               | If you are bound in disk IO, adding master-master write
               | nodes will not help you. The same number of bytes have to
               | be written to the same volume whether they come from a
               | replica or an application server. The only solution is
               | partitioning/sharding, and there is no "easy button" to
               | press and make that happen because it comes with its own
               | limitations and trade-offs, and is something the
               | application code will be intimately aware of.
        
               | lanstin wrote:
               | Auto sharding. A nice idea for a DAO layer.
        
           | vault_ wrote:
           | Autoscaling seems like a downstream concern from the
           | techniques being discussed here. Autoscaling tends to have a
           | pretty high latency, so you still need a strategy for being
           | overloaded while that extra capacity comes online. There's
           | also a question of how the autoscaler knows what "load" is
           | and when it's "too high." Just going off of CPU/memory usage
           | probably means you're over-provisioning. Instead, if you have
           | back-pressure or load-shedding built into your system you can
           | use those as signals to the autoscaler.
        
             | EdwardDiego wrote:
             | Autoscaling is great, if you solve the problems you rightly
             | mention.
             | 
             | But IMO it's best viewed not as a technique to increase
             | capacity that risks overprovisioning, but rather it should
             | be viewed as a technique to significantly reduce the
             | overprovisioning you were already likely doing to provide
             | capacity that could handle peaks in demand without blowing
             | through delivery expectations (e.g., timeliness, data loss
             | minimisation, etc.)
             | 
             | At an old employer, our load was seasonal over the day. If
             | one instance of an app could handle N req/s, and the daily
             | peak maxed out at 100N req/s, then we had to run 100
             | instances as a minimum (we usually chucked some extra
             | capacity in there for surprises) even if the mean daily
             | peak was 75N req/s.
             | 
             | And of course, at the times of the day when incoming reqs/s
             | was 0.5N reqs/s, well, we still had 99 instances twiddling
             | their thumbs.
             | 
             | And then there were the days when suddenly we're hitting
             | 200N req/s because Germany made the World Cup quarter-
             | finals, and things are catching fire and services are
             | degraded in a way that customers notice, and it becomes an
             | official Bad Thing That Must Be Explained To The CEO.
             | 
             | So when we reached a point in our system architecture
             | (which took a fair bit of refactoring) where we could use
             | autoscaling, we saved soooo much money, and had far fewer
             | Bad Thing Explanations to do.
             | 
             | We had always been massively overprovisioned for 20 hours
             | of the day, and often still overprovisioned for the other
             | 4, but we weren't overprovisioned enough for black swans,
             | it was the worst of both worlds.
             | 
             | (Although we kept a very close eye on Germany's progress in
             | the football after that first World Cup experience)
             | 
             | You're spot on that
             | 
             | a) to autoscale up effectively we had to minimise the time
             | an instance took to go from cold to hot, so focused a lot
             | on shared caches being available to quickly to populate in-
             | memory caches
             | 
             | b) adding new hardware instances was always going to take
             | longer than adding new app instances, so we had to find
             | some balance in how we overprovisioned hardware capacity to
             | give us breathing room for scaling without wasting too much
             | money and
             | 
             | c) we found significant efficiencies in costs and time to
             | scale by changing the signals used to scale after starting
             | out using CPU/mem.
             | 
             | Also a significant learning curve for our org was realising
             | that we needed to ensure we didn't scale down too
             | aggressively, especially the hardware stuff that scaled
             | down far faster than it scaled up.
             | 
             | We hit situations where we'd scale down after a peak had
             | ended, then shortly after along came another peak, so all
             | the capacity we'd just dynamically removed had to be added
             | back, with the inherent speed issues you mentioned, causing
             | our service to be slow and annoying for customers, with
             | minimal savings while capacity was trampolining.
             | 
             | (This incidentally can be really problematic in systems
             | where horizontal scaling can introduce a stop the world
             | pause across multiple instances of an app.
             | 
             | Anything that uses Kafka and consumer groups is
             | particularly prone to this, as membership change in the
             | group pauses all members of the CG while partitions are
             | reallocated, although later versions of Kafka with sticky
             | assignors have improved this somewhat. But yeah, very
             | critical to stop these kinda apps from trampolining
             | capacity if you want to keep data timeliness within
             | acceptable bounds.)
             | 
             | It took a lot of tuning to get all of it right, but when we
             | did, the savings were spectacular.
             | 
             | I think the CTO worked out that it only took six months of
             | the reduced AWS costs to equal the cost of the two years of
             | system refactoring needed to get to that point, and after
             | that, it was all ongoing cream for the shareholders.
             | 
             | And while I get the hate people have for unnecessary usage
             | of K8s (like Kafka, it's a complex solution for complicated
             | problems and using it unnecessarily is taking on a whole
             | lot of complexity for no gain), it was perfect for our use
             | case, the ability to tune how HPAs scale down, being able
             | to scale on custom metrics, it was just brilliant.
             | 
             | (I wish I could end with "And the company reinvested a
             | significant proportion of the savings into growth and gave
             | us all big fat bonuses for saving so much money", but haha,
             | no. The CFO did try to tell us we'd been unnecessarily
             | wasteful prior and should have just built a system that was
             | created in 2007 like the 2019 version from the start,
             | because apparently a lot of MBA schools have an entrance
             | requirement of psychopathy and then to graduate you have to
             | swear a bloodpact with the cruel and vicious God of
             | Shareholder Value)
        
           | hinkley wrote:
           | If I let my customers railroad me into running more servers
           | to fulfill their "needs" then I may transition into losing
           | money on my business. That's not a solution.
           | 
           | Needs is in scare quotes because a lot of traffic comes from
           | misunderstanding or laziness from customers or from other
           | divisions. Try as we might, nearly all of the improvements in
           | capacity per customer on my project in the last six months
           | have come from that sort of work. Other sorts of changes have
           | unfortunately been too close to the noise floor. They add up
           | but are difficult to demonstrate, and are very easy to undo
           | with new feature work.
        
             | o11c wrote:
             | Additionally, most users actually _are_ happy with  "slow
             | me down if I'm making too many requests". This is _much_
             | simpler than recovering from errors correctly.
        
               | lanstin wrote:
               | Which is why your test integration environment has to
               | throw 429 and 5xx errors consistently from day one. The
               | error handling paths are hard to retrofit but easy to do
               | before deployment.
        
               | o11c wrote:
               | (What test environment?)
               | 
               | That will make you introduce error _handling_ , but not
               | necessary "correct" handling. "Retry after errors" is a
               | great way to overload a system.
        
           | paulddraper wrote:
           | While that is a "real" solution, it not necessarily a
           | possible solution.
           | 
           | 1. It may be cost prohibitive.
           | 
           | 2. Some systems don't scale horizontally. (In particular,
           | databases.)
           | 
           | 3. Many systems do not scale linearly.
        
           | jiggawatts wrote:
           | In my experience, auto scaling has not yet been applicable in
           | the typical sense of adding more web servers.
           | 
           | If you use an efficient language like C# or Go, the
           | performance bottleneck moves to the database layer almost
           | immediately.
           | 
           | Caching can be added and can auto scale, but that has a
           | tradeoff: stale data or eventual consistency.
           | 
           | Databases are hard to scale because typically there has to be
           | a single master instance responsible for write ordering and
           | transaction consistency.
           | 
           | Auto scaling the database layer is fraught with its own
           | issues, and simply doesn't work in practice for many common
           | scenarios.
           | 
           | For typical business apps, or typical web pages that might
           | suddenly get a huge surge in usage (I.e.: Hug of death), a
           | CDN can help... _maybe_.
           | 
           | My customers keep asking for auto scale to solve their
           | problems, when the root cause is invariably a missing index
           | in their database.
        
         | marcosdumay wrote:
         | No, load shedding and back pressure present you trade-offs to
         | deal with an overloaded system.
         | 
         | Queues don't. Queues just present you problems. If you take an
         | overloaded system and add a queue, every single feature either
         | gets worse or doesn't get any better.
         | 
         | And people like to deny this, and pretend that queues will
         | help. They absolutely help with a lot of things but they do
         | nothing but harm in front of an overloaded system.
        
           | naasking wrote:
           | Your mistake is characterising all overloads as the same.
           | Adding a queue to a system that is consistently overloaded
           | won't solve the overload, but many overloads are
           | temporary/bursty, like the Slashdot/HN effect. Queues
           | absolutely do solve these kinds of overloads simply by
           | increasing latency, assuming increased latency is an
           | acceptable choice in your context, of course.
        
             | 10000truths wrote:
             | Maximum acceptable response times for websites is a few
             | seconds. The HN/Slashdot effect lasts for minutes to hours.
             | That time scale is way too long for a queue to be
             | effective.
        
               | naasking wrote:
               | You're missing the point. We're talking about general
               | systems here, not websites specifically, and the Slashdot
               | effect is a perfect example that everyone is familiar
               | with where queues do solve maintain availability if
               | longer latency is acceptable.
        
               | emn13 wrote:
               | Furthermore, even as a website aiming to survive some
               | momentary spike in performance - slowing everything down
               | for everyone _can_ be part of a solution to serve more
               | people but fewer things each. People might not normally
               | be willing to wait for more than a second or two for a
               | load; but when they expect or have a sign that things are
               | slower than normal, they might have a little more
               | patience (or just come back to that tab later).
               | 
               | I think people are a little too negative on queues. Sure,
               | really naively implemented with entirely unbounded
               | capacity they're an accident waiting to happen... but you
               | don't have to do that.
        
               | lanstin wrote:
               | The distinction is really not queues or not but unbounded
               | in theory or not. There are queues everywhere, connection
               | pools, unparsed messages, packets in flight, memory for
               | calls sent out and not yet replied or timed out, etc. but
               | one unbounded (in theory; they are never unbounded in
               | practice) queue can make your system go from robust to
               | fragile.
        
             | marcosdumay wrote:
             | After the overload fixes itself somehow, a queue absolutely
             | solves the availability problem. Yep. If the overload
             | doesn't last for too long.
             | 
             | What isn't happening on your example is a queue improving
             | anything on a overloaded system.
             | 
             | Queues are immensely helpful for all kinds of problems.
             | Just not for overload.
        
               | naasking wrote:
               | Overload is only an issue because availability is
               | critical. When it isn't critical, then avoiding overload
               | is obvious and practically trivial.
        
               | lanstin wrote:
               | An unbounded queue means you are allocating a lot of
               | stuff during the time you want to cap end all resources
               | handling traffic. It means you slow down when busier.
               | Depending on the framework it often means the system
               | can't recover automatically but will crash if the queue
               | is in memory or need DBAs to flush the queues or
               | whatever.
        
           | multimoon wrote:
           | As another user pointed out, this is entirely based on the
           | duration of the overload. If the overload is short, then
           | queues absolutely do help smooth this over for the user.
           | 
           | If demand is consistently exceeding capacity, the system will
           | fail regardless which method is used to compensate (queues,
           | shedding, back pressure, etc). Saying that a queues as a
           | whole serve no purpose in an overload is only considering the
           | catastrophic scenario, there is a lot more types of overloads
           | that can happen depending on the system.
        
           | smaudet wrote:
           | Queues provide you breathing room to increase your
           | throughput. They don't fix the clog, but they keep the
           | kitchen sink from over-flowing while you call the
           | plumber/install a new industrial line.
        
           | lanstin wrote:
           | A bounded queue that quickly errors back to the caller when
           | full is a fine load shedding system. Unbounded queues are
           | death.
        
         | epr wrote:
         | Pretty ironic, right? The entire article is the process of the
         | author making the same mistake he's complaining about.
         | 
         | For anyone reading the comments looking for more helpful
         | guidance, the most generally helpful advice you will get is to
         | start by measuring. Your measurements should help you identify
         | bottlenecks, but the answer of how to fix the issue is
         | dependent on your problem's constraints. In fact, there's such
         | a massive number of dimensions to the solution space of the
         | general problem of overloaded systems/services that it is
         | actually ridiculous for the author to try and write a single
         | ~2k words article explaining how to solve them all.
         | 
         | If you're hosting some non-critical service (which is clearly
         | the case for the author), then maybe load-shedding is
         | appropriate. If you're making a critical control system on the
         | other hand, load-shedding (allowing system failure by design)
         | is absurd. If you are already set up well for scaling (can be a
         | big up-front investment) and have more money than time, maybe
         | horizontal scaling is a no-brainer.
         | 
         | All those examples ignore the elephant in the room, which is
         | the single root cause performance problem you're likely to
         | discover after measuring your system and analyzing the results
         | (good analysis is equally important to measurement, especially
         | in large and/or distributed systems). This can range from
         | something that is very expensive and possibly impractical to
         | fix (ie: entire backend written in python), to a simple tweak
         | (ie: fixing a bad database query) that drastically improves
         | system performance with no downsides. The degree to which a
         | system and it's components were designed by engineers not
         | carelessly throwing away performance at the altar of "premature
         | optimization is the root of all evil" can also have a profound
         | impact.
        
           | lanstin wrote:
           | Your system has a capacity and limits. They can either be
           | explicitly configured in with deliberate load shedding
           | behavior or implicit with a devil may care behavior. Some
           | where between not getting dial tone when you pick up the
           | phone to weird vocal distortions when the limits are hit,
           | choose your poison. Or let randomness choose.
        
       | AtNightWeCode wrote:
       | Another thing. Queues often lack prioritization of messages; for
       | instance, the importance of a new user signup may be overlooked
       | compared to an address update.
        
         | thaumasiotes wrote:
         | There is a structure for just this purpose which, funnily
         | enough, is known as a "priority queue".
        
         | flemhans wrote:
         | Funny, I always think handling my existing customers and making
         | sure things work for them is more important than taking in new
         | ones.
        
       | mvf4z7 wrote:
       | Basically Little's law. It is queues all the way down.
       | 
       | https://en.wikipedia.org/wiki/Little's_law
       | 
       | Additionally, here is a great talk on queuing theory and load
       | shedding. One argument this talk makes is that autoscaling is not
       | the silver bullet you think it is (similar to queues).
       | 
       | https://www.youtube.com/watch?v=-oQl1xv0hDk
        
         | HPsquared wrote:
         | Little's law reminds me of the Continuous Stirred-Tank Reactor
         | model in chemical engineering. ChemEng has a lot of methods of
         | modelling complex systems, that can be carried over to other
         | domains.
         | 
         | https://en.wikipedia.org/wiki/Continuous_stirred-tank_reacto...
        
       | EspressoGPT wrote:
       | > Queues Don't Fix Overload
       | 
       | If the queue can withstand more simultaneous load than the
       | processing service, they basically do.
        
         | kevincox wrote:
         | Queues mitigate short-term overload. They don't fix overload.
         | Even in the short term there are likely detrimental effects (as
         | latency rises) and if your long term average input is higher
         | than your average output then you are still overloaded.
        
       | mhandley wrote:
       | What queues do is smooth out the mismatch between supply and
       | demand. If the mismatch lasts long enough, the queue will
       | overflow, and then you need to load shed (and you need to plan
       | for what the least bad way of load shedding is). But queues do
       | increase overall throughput, up to a point. If the demand was
       | bursty on short timescales and you only allow a small queue to
       | build before load-shedding, you may be wasting capacity.
       | Increasing the allowed queue size lets you smooth out the demand,
       | and so reduce the amount you need to load shed. But the crucial
       | thing here is you're trading off latency against capacity. If
       | that latency increase itself has no significant downside, then
       | you should probably provision the queues so that they can grow
       | until the latency _does_ start to have a downside, because below
       | that point the increase in capacity is worth it. Above that
       | point, the latency itself is a problem, and you should be load-
       | shedding rather than queuing. The trick with any queuing system
       | is to understand _when_ this transition should take place.
       | 
       | Edit: I should add that standing queues serve no purpose. If the
       | demand is such that you're persistently queuing for a long time
       | without the queue ever going idle, your transition between
       | queuing and load-shedding is in the wrong place, and you should
       | increase load-shedding, because you're adding latency for no
       | reason.
        
         | sokoloff wrote:
         | Re: your edit: I think in latency insensitive applications,
         | having standing queues that never empty can be totally fine. We
         | custom manufacture ordered items and have priority queues that
         | process input orders into machine-ready output files. We don't
         | need those queues to ever empty as the fastest delivery times
         | are 3 days and the slowest typical is 14 days. Some of the
         | file-prep software is licensed (so we have economic reasons to
         | keep that part of the farm size constant and no larger than
         | needed); in other cases, we use cheaper compute when available.
        
           | mhandley wrote:
           | Agreed, if there's no latency cost and no cost to maintaining
           | the queue. But if your queue is _never_ emptying, you 've hit
           | an equilibrium where you're persistently overloaded and
           | constantly load shedding. In that case you can reduce the
           | queue size so that it is only just not emptying - you'll
           | still be running at 100% capacity and shedding the same
           | amount of demand. And if there was a latency cost or a cost
           | per item in the queue, you reduce those.
           | 
           | Edit: if you have an external control loop that is matching
           | supply and demand, you could be persistently queuing without
           | load shedding, in which case you might want to maintain the
           | standing queue. In the absense of an external control loop,
           | supply and demand will pretty much never be in precise
           | equilibrium, in which case a persistent queue will pretty
           | much always also mean you're load-shedding.
        
             | sethammons wrote:
             | > But if your queue is never emptying, you've hit an
             | equilibrium where you're persistently overloaded and
             | constantly load shedding
             | 
             | I believe this is an incorrect assertion. You have hit
             | equilibrium, yes. But you are not load shedding unless you
             | are preventing new queue entrants. And if you are allowing
             | new entrants and the latency is acceptable to customers,
             | you are not overloaded. Like you said, you are at an
             | equilibrium.
        
             | datadrivenangel wrote:
             | It sounds like the above poster is in a situation with
             | multiple priority queues, where as long as latency and
             | throughput for high priority items is good, having higher
             | latency for lower priority items is fine, and could be a
             | form of low-cost back pressure on new orders or signalling
             | to improve the throughput a little bit.
        
             | sokoloff wrote:
             | As long as we are consistently emptying the queue of all
             | orders _due to be produced on the next shift_ , I think we
             | are correctly sized while losing nothing (no load-shedding
             | or turning orders away), even though we carry an overall
             | non-zero backlog essentially permanently.
             | 
             | Having 48 hours of weekend where order volume is way down,
             | but processing volume is constant is a key part of
             | maintaining this equilibrium.
        
         | PaulHoule wrote:
         | See https://en.wikipedia.org/wiki/Bufferbloat
         | 
         | I lived in Germany in 1999 and then the internet connection
         | from Germany to the US would get overloaded during the day. At
         | maybe 9am the latency would be < 1 s but it would start to grow
         | to a length of 90 seconds, _then_ it would start dropping
         | packets.
         | 
         | I don't know if it was the intention but it was about as good
         | as a ban on VoIP at preventing people from making international
         | VoIP calls.
         | 
         | Today there is more consciousness about the problem but people
         | frequently create an overly long or nominally unbounded queue:
         | in that case you start with one problem, insufficient
         | throughput, can add the problem of unacceptable latency by
         | adding a small queue. Limit the length of the queue and you put
         | a cap on latency.
         | 
         | Backpressure is hard to implement for social and political
         | reasons as much as technical, people don't really like the idea
         | that the system can refuse service and of course it often has
         | to be pushed further back than people would expect.
         | 
         | I'd point to the example of a medical specialist office where
         | once you are being seen as a patient you might have a series of
         | appointments. The initial appointment controls admission to the
         | system as a whole so you might wait a few months to get the
         | first appointment, enough that you might "balk" and look for
         | another doctor who can see you sooner. Once you are being
         | treated you often have no problem rescheduling an appointment
         | or scheduling follow-up appointments in a timely manner.
        
           | pjdesno wrote:
           | See RFC 970, "On Packet Switches With Infinite Storage" by
           | John Nagle: https://datatracker.ietf.org/doc/html/rfc970
           | 
           | Back then (I was studying networking as an undergrad at the
           | time, and interned with the Arpanet team) people really did
           | think of network congestion as a buffer allocation problem,
           | so the obvious solution was more buffering - i.e. adding
           | queues. Nagel was one of the first people to point out the
           | problem with this.
        
             | kccqzy wrote:
             | That's because it is obvious and intuitive even though it's
             | wrong. In the first year of my undergrad in an assignment I
             | added unlimited buffering to every single place that needed
             | buffering. I remember a sense of superiority from all the
             | extra code I wrote to do it. Of course buffering felt
             | correct! That was long before I heard of the word
             | backpressure.
        
             | citrin_ru wrote:
             | Sometimes it is - TCP incast [1] in a fast LAN can be
             | mostly alleviated by using switches with large buffers.
             | Generally the higher throughput the bigger buffers you need
             | unless having low latency is more important than low packet
             | loss. The queue size is a tradeoff (as almost everything).
             | 
             | [1] https://www.usenix.org/system/files/login/articles/chen
             | 12-06...
        
               | jiggawatts wrote:
               | One of our customers bought a special switch with huge
               | buffers, on the order of hundreds of megabytes per port.
               | 
               | It was a specialised SKU used only for dedicated backup
               | networks. As in, networks that process only the traffic
               | for disaster recovery backups. Latency on that type of
               | network traffic is totally irrelevant, and the only thing
               | that matters is achieving the maximum possible throughput
               | -- whatever the wire protocol allows.
               | 
               | That's the one time I've seen exactly 10 Gbps sustained.
        
             | foofie wrote:
             | > (...) people really did think of network congestion as a
             | buffer allocation problem, so the obvious solution was more
             | buffering - i.e. adding queues.
             | 
             | I don't think people believed network congestion was a
             | buffer allocation problem. I think people believed
             | buffering was a way to mitigate congestion caused by a
             | spike in traffic before allowing connection problems to
             | surface (dropped packets, dropped connections, etc).
             | 
             | Buffers are bounded which, by definition, means they can be
             | filled up. Once they are filled up then the same problems
             | that were caused by a congested network without buffering
             | would also be experienced in connections with buffering.
             | It's hard to believe that back then no one noticed that
             | buffers would only take so much data.
             | 
             | > Nagel was one of the first people to point out the
             | problem with this.
             | 
             | Correct me if I'm wrong, but the problem that was pointed
             | out was instead modes of failure that take place when
             | networks are congested but buffers are still able to take
             | more data. We're talking about failure modes that happen
             | when the network is congested and buffers are not
             | completely filled but they neither can flush their data nor
             | drop packets/connections, thus it's a steady state
             | characterized by a network that in theory is working, but
             | induces high latency.
             | 
             | Also, limiting buffers is also not a fix for network
             | congestion problems. It's a way to allow a failure mode
             | (dropped packets) to happen much earlier than another
             | failure mode (long queues with high latency).
        
               | Animats wrote:
               | > I don't think people believed network congestion was a
               | buffer allocation problem.
               | 
               | Actually, they did, because everything in the early days
               | had very small memory sizes. This was the 16-bit era.
        
             | Animats wrote:
             | That was a long time ago. Yet we still have bufferbloat
             | problems. Sigh.
        
             | aidenn0 wrote:
             | Ugh. WiFi signals aren't the best in my A/V cabinet, so I
             | ran G.hn powerline to it. This works great 98% of the time,
             | but occasionally there will be something that blocks
             | traffic for a short period of time, and those things must
             | have _huge_ buffers.
             | 
             | If I'm e.g. watching Netflix when there is a hiccup, I see
             | ping times of over a minute! I wrote a program that
             | monitors ping times and reboots the G.hn adapters when they
             | get too high and the problem mostly went away. I tried a
             | couple different brands of adapters, but they are all
             | obviously the same firmware.
        
           | vidarh wrote:
           | Another approach than capping the queue explicitly is to
           | record time and cap latency by treating the queue as full if
           | the front of the queue is "too old". If the time to process
           | the queue is easily predictable this is roughly the same, but
           | when it's not it'll make the queue length adapt dynamically
           | depending on maximum acceptable latency.
           | 
           | To your latter example, there are load balancers with support
           | for this pattern: serving up a "please wait" page to _new_
           | users to prevent unacceptable latency for users already using
           | the site. Frankly more sites ought to do that.
        
           | drivers99 wrote:
           | > Backpressure is hard to implement for social and political
           | reasons as much as technical
           | 
           | Cal Newport has a video about how people will average around
           | 20% above their capacity because at that point they have
           | mental cover (permission given to oneself) to start rejecting
           | additional requests.
           | https://www.youtube.com/watch?v=TH_xAR7pljU
        
         | eddd-ddde wrote:
         | The simplest example, try to empty a bucket on a pipe, you'll
         | have a real struggle, you are gonna be standing there for
         | awhile. Now empty a bucket in a sink, yeah it will fill up and
         | maybe you can do two buckets at once, but after dumping it in
         | one second you can leave it unattended and know that all that
         | water will drain correctly.
        
         | kqr wrote:
         | It's important to note that for interactive applications,
         | queues can have surprisingly little capacity and still provide
         | all the demand levelling necessary. Technically this depends on
         | the variance of the arrivals, but in my practical experience,
         | even quite bursty arrivals are often well-behaved enough that a
         | second or so of queue capacity is sufficient.
         | 
         | Any time I've had queues need more than a second of capacity
         | has been when the system is broken and user experience has
         | already degraded to the point where it might be better for the
         | median user to shed a small amount of load to get back to a
         | sub-second queue.
         | 
         | (In batch processing where you aim for maximum throughput,
         | obviously you need a long queue to make sure you never run out
         | of job.)
        
           | vidarh wrote:
           | You don't need a long queue even there if the producer can
           | keep a short queue filled. E.g. a producer that receives an
           | error when the queue is full and backs off but never backs
           | off enough for the queue to fully empty can very well do just
           | as well as maximizing throughout.
        
         | wegfawefgawefg wrote:
         | buffers buffer. difficult to debate.
         | 
         | edit: theyre being called queues and a lot of effort is being
         | put here into just describing the merits and demerits of them
         | which i assumed would be self evident to anyone who can
         | conceptualize a buffer in any context, not just software.
         | savings buffer, heat buffer, coal buffer, resource buffer,
         | inductors, capacitance.
         | 
         | they have fill and drain times, soften spikes, but require
         | additional energy.
         | 
         | A math model of throughpts and voltages would be more useful
         | than all the metaphors of pipes and buckets in this thread.
        
         | tanveergill wrote:
         | We have been building a platform called Aperture in the open-
         | source trying to solve this very problem. Our approach is to
         | let the developer decide how long they want to wait in the
         | queue using a timeout parameter. If timeout is hit, they may
         | re-try with exponential backoff or load shed. While in the
         | queue, requests get prioritized based on a weighted fair
         | queuing algorithm. If there is tiering in the app, there could
         | be a policy that can allocate majority of the capacity to paid
         | vs free customer tiers. But this allocation is not really
         | static, if there is free capacity available in the system then
         | the free customer tier can take all of it. This is just like
         | how CPU time is allocated by Linux based on nice values, even
         | low priority processes are allowed to take up all the CPU time
         | when demand is low. Apart from relative allocation across user
         | tiers, Aperture's request scheduler can also ensure fairness
         | across individual users within each tier to make sure no single
         | user is hogging up all of the server capacity.
         | 
         | The demand and capacity is determined based on a request rate
         | quota or the maximum number of in-flight requests.
         | 
         | Would love the community here to check us out on GitHub and
         | provide feedback: https://github.com/fluxninja/aperture
        
       | jmull wrote:
       | Queues aren't really the problem here.
       | 
       | It's that the people making changes don't have a decent
       | understanding of the system they are trying to fix. If you don't
       | actually know what the problem is, your fix is not likely to
       | work.
       | 
       | I have seen groups put huge amounts of work into a "fix" for a
       | system when they are only really guessing at what the problem is.
       | (Besides queues, people also seem to like adding "caches" --
       | often stateful and with no validation strategy so not really
       | caches -- or really any kind of wrapper or shim.)
       | 
       | I think it is really useful to raise awareness of this kind of
       | thing, but I think the article could put it a little better. For
       | one thing, queues can "fix overload" depending on what you mean
       | by that. They don't increase system capacity but can let a system
       | handle bursty usage better.
        
         | sethammons wrote:
         | Ran into this at work last week. A team wanted to add an in-mem
         | cache to speed things up. However, they have no way nor plan to
         | measure its effectiveness. I asked how they will know the cache
         | hit ratio or otherwise know how the queue is working and they
         | pushed back that such monitoring would be work they weren't
         | planning. Bananas.
        
         | john-tells-all wrote:
         | Strong agree.
         | 
         | Caches can be horrible:
         | 
         | - adds complexity of code and architecture
         | 
         | - adds run-time complexity, with new failure modes
         | 
         | - adds semi-random delays
         | 
         | - see also: Buffer Bloat
         | 
         | There's a reason it's one of the Two Hard Problems in Computer
         | Science! https://martinfowler.com/bliki/TwoHardThings.html
        
         | taneq wrote:
         | > I have seen groups put huge amounts of work into a "fix" for
         | a system when they are only really guessing at what the problem
         | is.
         | 
         | This is sadly far more the rule than the exception. So many
         | times I've heard "we think the problem is [wild-ass guess] and
         | so we're going to do [lengthy re-write] to fix it." And when I
         | say "hey uh maybe we could do [simple test] to prove whether
         | [wild-ass guess] is actually happening?" it's "oh no we can't
         | waste time we have to start the rewrite" or even just "huh,
         | yeah I guess we could test it that way, anyway... we start the
         | rewrite tomorrow".
         | 
         | Edit: one specific case I remember talking to a mechanical
         | engineer about how a particular mass-spring system would react
         | to a momentary impulse and he said "well there's no way to know
         | what'll happen so let's just try it." Thinking back I can
         | rationalise this answer as "we don't fully understand the
         | dynamics of the piece of third-party equipment generating the
         | impulse so let's just use this as a starting point" but at the
         | time it made me so mad. Like, if only there were a discipline
         | entirely dedicated to using mathematical modeling techniques to
         | predict the behaviour of physical systems in order to back-
         | calculate the necessary physical parameters to achieve a given
         | behaviour.
        
       | tiffanyh wrote:
       | (2014)
       | 
       | Fred is a big Erlang advocate from 10+ years ago.
       | 
       | He's written multiple books on Erlang, from his real-world
       | experience, of using it to build Heroku (Erlang underpins much of
       | the original Heroku platform and large pieces still today).
       | 
       | Much if his writing is influenced by said experience and Erlang
       | native queues.
       | 
       | https://www.erlang-in-anger.com
       | 
       | https://learnyousomeerlang.com
        
       | cpursley wrote:
       | Good writeup (with great diagrams). And a reminder how much we
       | get out of the box with Elixir/Erlang.
        
       | JoelEinbinder wrote:
       | Real life queues can be scary too. I think of how complicated
       | Disney's fast pass system got
       | https://www.youtube.com/watch?v=9yjZpBq1XBE. Luckily with
       | software it is way easier to get more servers than it is to build
       | more theme park rides.
        
       | speg wrote:
       | This is fitting. I just spent 99 minutes trying to register my
       | kid for summer camps. Eventually got a "payment has been
       | processed redirecting to receipt" message... which errored out.
        
       | taf2 wrote:
       | Been doing a lot of work on electronics and I think Queue's are
       | very similar in a system to capacitors. They smooth out load
       | either by buffering new work/load or holding onto pending
       | work/load...
        
       | regulation_d wrote:
       | In the first paragraph, Fred mentions and links to Erlang in
       | Anger.
       | 
       | When I was coming up to speed on the BEAM this was such an
       | amazing resource. The fact that it's free is bananas. In addition
       | to the load management stuff, he also talked about some
       | observability details that are really helpful (like how to more
       | accurately understand CPU load on the BEAM). Highly recommend.
        
       | bhaak wrote:
       | What's the snarky bit at the end of article about?
       | 
       | > And then of course, there's the use case where you use the
       | queue as a messaging mechanism between front-end
       | threads/processes [...] because your language doesn't support
       | inter-process communications.
       | 
       | Isn't it the other way around? A message broker is used as a
       | simple queue but then you immediately have the option to scale up
       | to n producers and m consumers if demand requires is.
       | 
       | But even if there is only one single producer and one single
       | consumer you still got the advantage of the decoupling of the two
       | sides of the queue.
        
         | macintux wrote:
         | I imagine he's being snarky about the fact that Erlang was
         | designed around simple queues and thus you get that abstraction
         | essentially for free, while in other languages you have to add
         | more infrastructure once you recognize the need.
        
       | dboreham wrote:
       | On the positive side, this provides a handy test for determining
       | if someone has a clue.
        
       | 0xcde4c3db wrote:
       | See also: RFC 970: On Packet Switches With Infinite Storage [1]
       | 
       | [1] https://www.rfc-editor.org/rfc/rfc970.html
        
       | bob1029 wrote:
       | Ring buffers and clever batching abstractions can help get you
       | much closer to ideal.
       | 
       | > When consumers are waiting on an advancing cursor sequence in
       | the ring buffer an interesting opportunity arises that is not
       | possible with queues. If the consumer finds the ring buffer
       | cursor has advanced a number of steps since it last checked it
       | can process up to that sequence without getting involved in the
       | concurrency mechanisms. This results in the lagging consumer
       | quickly regaining pace with the producers when the producers
       | burst ahead thus balancing the system. This type of batching
       | increases throughput while reducing and smoothing latency at the
       | same time. Based on our observations, this effect results in a
       | close to constant time for latency regardless of load, up until
       | the memory sub-system is saturated, and then the profile is
       | linear following Little's Law [6]. This is very different to the
       | "J" curve effect on latency we have observed with queues as load
       | increases.
       | 
       | https://lmax-exchange.github.io/disruptor/disruptor.html#_ba...
        
       | kqr wrote:
       | For anyone interested in this subject, the book _Performance
       | Modeling and Design of Computer Systems: Queueing Theory in
       | Action_ is really, really good. TFA introduced me to queueing
       | theory and that book made me understand the subject better than
       | anything else I have read.
        
       | jesuslop wrote:
       | They fix jitter.
        
       | lucidguppy wrote:
       | I think we all need to read Enterprise Integration Patterns.
       | 
       | You would have an idempotent queue + a circuit breaker pattern.
       | If the queue depth is too large - you break the circuit.
       | 
       | But if your profit per request is higher than your infracost per
       | request - why not autoscale till the cows come home? Within
       | reason of course.
        
       | supertrope wrote:
       | The US Veterans Affairs system has waiting lists for medical
       | care. Due to Congressional oversight the length of this waiting
       | list was scrutinized. Adding capacity through hiring more medical
       | personnel takes more budget and time. Load shedding through not
       | letting people get on the wait list is unacceptable. So the
       | bureaucracy added an additional buffer. They added patients to a
       | secret overflow wait list and moved them to the official wait
       | list only when it was not too long.
       | https://www.cnn.com/2014/04/23/health/veterans-dying-health-...
        
         | klabb3 wrote:
         | That's an innovative way of doing it! Another, much more
         | prevalent way is: only open appointments for 1 week/month
         | ahead, then reject new bookings when full. Tell your clients
         | "just book fast at Monday X am" knowing full well it's a damn
         | lottery. Or even worse, don't tell customers about the release
         | time "to fight bots".
         | 
         | Then of course, blame bots for taking the appointments, and
         | somehow fool people that bots are the reason for the size of
         | the _real_ queue.
         | 
         | This has come up so many times I've lost count. Usually with
         | governments or some service that a corporation provides
         | reluctantly. It does one thing really well: obscure the true
         | size of the queue. Just like in the article, that's a feature.
        
       | zokier wrote:
       | What's tricky is that generally queues are difficult to avoid;
       | they are everywhere in computing. Network has queues, hardware
       | has queues, services have queues, operating systems have queues,
       | frameworks and runtimes have queues, your clients will have
       | queues. You often have very limited visibility on many of those,
       | nevermind having any controls, especially if you are not even
       | aware that the queue exists.
        
       | dgrr19 wrote:
       | The hard truth is that you need a ring buffer and start
       | discarding data, if you can!!!! If you can't then you'll have to
       | wait or batch updates :) That's why functions like writev/readv
       | exist
        
       | politician wrote:
       | It's extremely easy to introduce a backpressure mechanisms into
       | your tech stack by using Go as a valve. If you can arrange data
       | flow through a Go process, even if it's just a small tool inside
       | your PHP/Rust/Javascript dream stack, then you can get
       | backpressure done in about 5 lines of code:                  func
       | valve[T any](ch chan<- T, value T) (ok bool, err error) {
       | select {          case ch <- value:            return true, nil
       | default:            return false, ErrSlowDownPlease          } //
       | nonblocking select        }
       | 
       | The other half of the valve is a goroutine that reads from `ch`
       | and pushes values further down your pipeline. The channel can be
       | buffered or unbuffered, it doesn't matter. Flow through this
       | coupling will only proceed at the rate supported by the
       | goroutine, and you can use this fact to partition your stack into
       | rated flow domains.
       | 
       | Sometimes it's better to deploy a simple tool into existing
       | stacks to help patch overload conditions until a "native" fix can
       | be developed.
        
         | sapiogram wrote:
         | Not sure what problem you think this is solving, but you can do
         | the same in every language ever with a simple semaphore. Number
         | of permits represent the channel buffer size.
        
           | politician wrote:
           | I'm just adding this to the training corpus for LLMs that
           | will eventually replace most developer jobs. Thanks for your
           | affirmation that this is a good solution!
        
       | cdowns wrote:
       | I recently ran into an overload issue and it turned out to be a
       | not-obvious "hard limit" that was mentioned. Everything would be
       | smooth for a bit and then my throughput would be halved after I
       | walked away, backing up the queues indefinitely and paging me
       | again.
       | 
       | I had moved a single-broker RabbitMQ from GCP to AWS and the
       | instance type I chose had bandwidth "up to" 10Gbps. Being less
       | familiar with AWS, I did not realize they will actively throttle
       | based on credits because "up to" means "burstable to" regardless
       | of available capacity. My messages are pretty large and I was
       | running out of credits after about an hour.
       | 
       | Bandwidth was the last thing I considered since I hadn't had the
       | issue on GCP. Switching to a larger instance with guaranteed
       | bandwidth was a band-aid. Clustering to spread the load between
       | multiple instances will be my longer term fix. Lesson learned,
       | hopefully this helps someone someday.
        
       | taneq wrote:
       | Ultimately if you over load, you overload. A system has a
       | capacity, and a load. If the load is over the capacity, then not
       | everything is gonna happen. This seems like a pretty basic law
       | of, like, doing stuff.
       | 
       | [reads more]
       | 
       | Oh, OK that's a lot of words to say "benchmark to find
       | bottlenecks" and "focus on bottlenecks when optimising" but I
       | understand, sometimes it takes a lot of words to get this simple
       | idea across. There's a few times over the years since it was
       | published that this article would have been perfect to send to a
       | special someone who keeps insisting we implement Merkle trees or
       | whatever in order to speed up a product that reads in all its
       | files using                   while (fscanf(cur_file, "%c", ch))
       | {             // process character here         }
        
       | john-tells-all wrote:
       | This article is a version of Theory of Constraints[0] aka Value
       | Stream Mapping:
       | 
       | - every system has a bottleneck
       | 
       | - fixing things _after_ the bottleneck will have no effect
       | 
       | - fixing things _before_ the bottleneck _will make the bottleneck
       | worse_
       | 
       | https://en.wikipedia.org/wiki/Theory_of_constraints
        
         | catketch wrote:
         | TOC has its applications and is simple in principle to operate,
         | but the focus on global bottleneck constraints leads to sub-
         | optimal behavior when the bottlenecks stall or fill queues. A
         | big problem is once you are clogging the bottleneck, upstream
         | queues become progressively more clogged as well and restarts
         | are a mess. Local queue constraints on pull systems (kanban is
         | an example) give quicker constraint signals and smoother queue
         | restarts. Reinerstsen's "Principles of Product Development
         | Flow" has some great discussion of this
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Queues Don 't Fix Overload_ -
       | https://news.ycombinator.com/item?id=8632043 - Nov 2014 (59
       | comments)
        
       | dtaht wrote:
       | Really good thread. Several comments:
       | 
       | Flow Queuing allows applications not creating queues to bypass
       | those that are. It is mildly different from fair queuing:
       | https://ieeexplore.ieee.org/document/8469111
       | 
       | Load shedding, at least for packets, benefits from head drop more
       | than tail drop. Only the codel algorithm does this but codel has
       | been applied to other forms of queuing like Uber dealing with a
       | concertgoer overload.
       | 
       | Kathie Nichols has given some great presentations lately:
       | https://www.understandinglatency.com/
       | 
       | There are a lot of unbounded queues in many applications &
       | libraries, even in rust. They make me twitchy. Even with bounded
       | queues the number chosen is usually arbitrary. I wish a timed
       | queue was a default rather than a length.
       | 
       | I highly recommend Kleinrocks work on these subjects.
       | 
       | I am grumpy big vendors like juniper have yet to adopt smart
       | queues... despite the obvious benefits.
       | https://blog.cerowrt.org/post/juniper/
        
       | adrianmonk wrote:
       | > _All of a sudden, the buffers, queues, whatever, can 't deal
       | with it anymore. You're in a critical state where you can see
       | smoke rising from your servers, or if in the cloud, things are as
       | bad as usual, but more!_
       | 
       | There's a valid point here, which is that queues can mask
       | problems. Everything seems fine for a while. Until suddenly it
       | isn't.
       | 
       | Queues take away important feedback about load. Without feedback,
       | you don't know that problems are looming. You also may falsely
       | believe you've fixed something when you really haven't.
       | 
       | But, there's a solution: monitoring and alerting on the queue.
       | Don't alert when it is just about full or just about as bad as
       | you can allow. Alert as soon as it starts to grow beyond normal.
       | 
       | Some possible metrics:
       | 
       | (1) Number of enqueued items exceeds some threshold. Simple, but
       | you may have to readjust the threshold from time to time. (And if
       | the threshold is too low, people may learn to ignore this alert.)
       | 
       | (2) Length of time the most recently dequeued item had been
       | sitting in the queue. When you dequeue an item, if it's been in
       | there for hours (and you were expecting minutes), something is
       | probably wrong.
       | 
       | (3) How long it has been since the queue was (within epsilon of)
       | empty. If you kinda mostly need items to be processed immediately
       | and the queue is a fallback, it shouldn't be used very often or
       | for long stretches of time. (You could also alert on what
       | percentage of the time, over some time window, it was / wasn't
       | empty.)
       | 
       | (4) How long it has been since a worker (successfully) processed
       | an item taken from the queue. If all your workers die, you might
       | as well know about it right now. (You need to somehow account for
       | empty queues, i.e. workers not doing anything because there's no
       | work to do.)
        
         | lanstin wrote:
         | At AOL there was a weekly queue depth review meeting that led
         | to hardware ordering requests. Worked great till the business
         | stopped growing.
        
       | hintymad wrote:
       | Speaking of queues, any book on in-depth practical queuing
       | theory? This book is highly recommended by many people, including
       | those on HN:
       | https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.htm....
       | However, a reading group by seasoned researchers said the book
       | was not necessarily practical on production systems:
       | https://emptysqua.re/blog/review-queue-theory-book/.
        
       | gillh wrote:
       | Anyone looking to build a practical solution that involves
       | weighted-fair queueing for request prioritization and load
       | shedding should check out - https://github.com/fluxninja/aperture
       | 
       | The overload problem is quite common in generative AI apps,
       | necessitating a sophisticated approach. Even when using external
       | models (e.g. by OpenAI), the developers have to deal with
       | overloads in the form of service rate limits imposed by those
       | providers. Here is a blog post that shares how Aperture helps
       | manage OpenAI gpt-4 overload with WFQ scheduling -
       | https://blog.fluxninja.com/blog/coderabbit-openai-rate-limit...
        
       | Animats wrote:
       | FIFO queues do not fix overload. Fair queuing queues, though, can
       | fix it if the problem is coming from a specific source.
       | 
       | I have a web site set up that way. An in-memory table in MySQL is
       | used to handle queuing for a slow operation. The queuing system
       | has some fairness. This works well enough that one site sent
       | bogus requests for a month at a high rate, and it had no effect
       | on actual users.
       | 
       | Fair queuing in SQL:                   SELECT domain,
       | requestor_ip_hash, rating_state_int, base_domain FROM ratingqueue
       | AS rqueue         WHERE (rating_state_int = 3 OR rating_state_int
       | = 4)         AND NOT EXISTS(SELECT * FROM ratingqueue
       | WHERE base_domain = rqueue.base_domain                  AND
       | (rating_state_int = 1 OR rating_state_int = 2))         ORDER BY
       | rating_state_int, request_timestamp         LIMIT 1;
        
         | tanveergill wrote:
         | I agree that fair queueing is a good solution to this problem!
         | A request scheduler based on weighted fair queuing is also
         | central to Aperture, an open-source load management system my
         | team has been building for the last 2 years. Priorities and
         | tokens (request weights) can be provided to Aperture when
         | scheduling requests. It runs weighted fair queuing that ensures
         | relative allocation of capacity based on the relative values of
         | (token/priorities). It also ensures fair allocation of capacity
         | across users within the same priority class so that no single
         | user can hog up all the capacity.
         | 
         | Would love to hear your thoughts about this project:
         | https://github.com/fluxninja/aperture
        
       | aidenn0 wrote:
       | Corollary: The part of the system with the least amount of
       | queuing will tend to fail the soonest in a situation of overload,
       | thus you can always move the blame to a different part of the
       | system by increasing queuing at the part that is failing.
        
       ___________________________________________________________________
       (page generated 2024-01-18 23:01 UTC)