[HN Gopher] Latency Exists, Cope (2007)
___________________________________________________________________
Latency Exists, Cope (2007)
Author : sunainapai
Score : 22 points
Date : 2021-10-18 04:26 UTC (2 days ago)
(HTM) web link (web.archive.org)
(TXT) w3m dump (web.archive.org)
| spsesk117 wrote:
| I definitely agree with the line of thinking posited, what I'm
| less clear on is what the discrete implementation of these ideas
| looks like.
|
| I worked at a company once with huge monolith system, mainly
| revolving around a relational database. We tried for years to
| break out of this, and if my understanding is correct this
| organization is still on this monolith today. There were a number
| of challenges, and we tried a number of different solutions
| (NoSQL models/object stores/etc), but it felt like there were
| base level assumptions about the availability of data in the core
| application that felt impossible to address without a full scale
| rewrite and reevaluation of all previous assumptions.
|
| Perhaps I've answered my own question here -- it just needed to
| be completely redesigned from the ground up. Short of doing that
| however, would anyone care to provide high level insight on how
| they'd break down this problem, and what technology they might
| use to address it?
| dwheeler wrote:
| "Monolith" is not a problem. It's just a description of an
| architectural approach. Often switching to alternatives, like
| microservices, is a terrible idea:
| https://medium.com/swlh/stop-you-dont-need-microservices-dc7...
|
| The question is: What do you actually need to do? I.e., what
| are your requirements?
|
| It's all about trade-offs. If you can't identify at least one
| pro & one con to an approach you're considering, you don't
| adequately understand the approach you're considering.
| dragontamer wrote:
| What's wrong with the monolith? If that monolith was designed
| 30 years ago on RAIDed hard drives, upgrading to an all-SSDs
| 4TB RAM system today probably would solve all of those issues.
|
| Really: is the monolith problem so difficult that a $400,000
| computer can't solve the problem? Is there any developer
| project you can fund for say $1 million ($400k for the
| computer, 600k for system IT time to transfer the data to the
| new computer) that would give the same return on investment?
|
| $400k gets you an all RAIDed PCIe 4.0 SSDs + 4TBs of RAM,
| 128-core dual-socket EPYC or something. That __probably__ can
| run your monolith.
| Animats wrote:
| The author clearly hasn't talked to game developers. Their users
| care about latency and will scream about 30ms delays. So game
| developers obsess on this.
|
| Much of this involves getting things off the critical path.
| Background updating isn't as time-sensitive.
| dragontamer wrote:
| Latency exists, but no one cares about it until it crosses over a
| critical threshold!!
|
| Throughput is the number most people care about. As long as
| latency remains "below the critical rate" (which is application
| dependent), throughput remains the more important figure.
|
| As such: we can perform latency/throughput tradeoffs in practice.
| Maybe even latency/throughput/simplicity tradeoffs.
|
| * The absolute lowest latency is a single-thread system that
| blocks on everything. Wait until X is ready and immediately start
| doing Y. This happens to be very simple and elegant code in
| practice, but it can only do 1-thing at a time.
|
| * However, people want more throughput. You can use pthreads, or
| golang / fibers / cooperative threads, to convert the "single
| threaded" code into higher-throughput code. (Get the CPU to "work
| on something else" while waiting for X). This makes latency
| worse, but increases throughput dramatically. Multi-core
| accelerates this pattern very naturally.
|
| * For highest levels of throughput and lowest latency, you need
| an event-driven loop. Yes, the GetMessageW() loop in Win32, game-
| loops in video games, poll() in Linux, and the like. This is a
| bit difficult to use in practice, so people use async to help
| decompose the "big loop" in practice. Generalizing to multicore
| is difficult however. But virtually every "high performance"
| system I've ever seen comes down to some glorified event loop /
| poll / epoll / async / GetMessageW() pattern. Literally all of
| the ones I've ever seen.
|
| -------
|
| I'd say, work on throughput until latency becomes an issue.
| pthreads / fibers are your #1 easiest tool IMO to reach for, and
| have adequate performance for ~100,000 events per second or so.
| bob1029 wrote:
| > However, people want more throughput.
|
| This is where the dragons enter the room. 9/10 times the thing
| that the business wants to make go faster is almost exclusively
| a serialized narrative of events that has to occur in a certain
| order.
|
| For problems that do not demand a global serialized narrative
| of events, we can certainly throw Parallel.ForEach at it and
| call it a day.
|
| For everything else, you want a ring buffer with spinwait &
| (micro)batch processing. 100k events per second is paltry
| compared to what is possible if you take advantage of
| pipelining and cache in modern CPUs. I have personally tested
| practical code piles that can handle ~14 million events per
| second _with_ persistence to disk using only 1 thread the
| entire time. Impractical academic test cases can get to upwards
| of half a billion events per second on a single thread:
|
| https://medium.com/@ocoanet/improving-net-disruptor-performa...
|
| You would think low latency and "batch" processing would not go
| hand-in-hand, but they certainly do when you are dealing with
| things at scale and need to take advantage of aggregate effects
| across all users. This technique also has the effect of
| substantially reducing jitter. It's amazing what is possible if
| you can keep everything warmed up.
| dragontamer wrote:
| > For everything else, you want a ring buffer with spinwait &
| (micro)batch processing.
|
| Note: orders are only globally defined with single-consumer /
| single-producer. Which... probably should just be a function-
| call in most situations. (Notable exception: the consumer
| wants to stay "hot" in consumer code, and the producer wants
| to stay "hot" in producer code. 2-threads, one for the
| producer, one for the consumer).
|
| If you even go to single-consumer / multi-producer, then
| order is no longer defined. Ex: Producer A creates A1, A2,
| A3, A4. Producer B creates B1, B2, B3.
|
| Single-consumer/multi-producer can consume in many orders:
| A1, B1, A2, B2, A3, A4 for example. Or maybe even A1, A2, A3,
| A4, B1, B2, B3 is also valid. Or maybe B1, B2, B3, A1, A2,
| A3, A4.
|
| Multi-consumer/multi-producer is even "better". Multi-
| consumer means that you can "execute" in A4, A3, A2, A1, B3,
| B2, B1. (Lets say you have 4x consumers, all working on A1,
| A2, A3, and A4. Well, Consumer1 takes A1, but Consumer1 is
| slower because L1 cache wasn't warm, and Consumer4 had L1
| cache warmed up just right. That means Consumer4 executing A4
| will finish first).
|
| As such: sequentially-consistent spinwait is good, but
| suboptimal. The answer is to allow __race-conditions__ to
| pick the order, because no one cares anymore about ordering
| if they're asking for multi-consumer/multi-producers.
|
| ---------
|
| > 100k events per second is paltry compared to what is
| possible if you take advantage of pipelining and cache in
| modern CPUs.
|
| I agree. But 100k events per second is more than enough for a
| great number of tasks. There's a degree of simplicity to
| fork/join or pthread_create / pthread_joins. Especially if
| you avoid async-style code.
|
| Keeping all your logic together (instead of spread out across
| a variety of async functions) can be beneficial. Sure its
| suboptimal, but you don't want to make things harder on
| yourself unless you actually need that performance.
| bob1029 wrote:
| > I agree. But 100k events per second is more than enough
| for a great number of tasks. There's a degree of simplicity
| to fork/join or pthread_create / pthread_joins. Especially
| if you avoid async-style code.
|
| 100%. There is certainly an additional complexity cost to
| be paid if you want (need) to go beyond 7 figure-per-second
| numbers. Redefining a problem to fit a ring buffer is a lot
| harder than throwing basic threading primitives at it.
___________________________________________________________________
(page generated 2021-10-20 23:02 UTC)