[HN Gopher] Latency Exists, Cope (2007)
       ___________________________________________________________________
        
       Latency Exists, Cope (2007)
        
       Author : sunainapai
       Score  : 22 points
       Date   : 2021-10-18 04:26 UTC (2 days ago)
        
 (HTM) web link (web.archive.org)
 (TXT) w3m dump (web.archive.org)
        
       | spsesk117 wrote:
       | I definitely agree with the line of thinking posited, what I'm
       | less clear on is what the discrete implementation of these ideas
       | looks like.
       | 
       | I worked at a company once with huge monolith system, mainly
       | revolving around a relational database. We tried for years to
       | break out of this, and if my understanding is correct this
       | organization is still on this monolith today. There were a number
       | of challenges, and we tried a number of different solutions
       | (NoSQL models/object stores/etc), but it felt like there were
       | base level assumptions about the availability of data in the core
       | application that felt impossible to address without a full scale
       | rewrite and reevaluation of all previous assumptions.
       | 
       | Perhaps I've answered my own question here -- it just needed to
       | be completely redesigned from the ground up. Short of doing that
       | however, would anyone care to provide high level insight on how
       | they'd break down this problem, and what technology they might
       | use to address it?
        
         | dwheeler wrote:
         | "Monolith" is not a problem. It's just a description of an
         | architectural approach. Often switching to alternatives, like
         | microservices, is a terrible idea:
         | https://medium.com/swlh/stop-you-dont-need-microservices-dc7...
         | 
         | The question is: What do you actually need to do? I.e., what
         | are your requirements?
         | 
         | It's all about trade-offs. If you can't identify at least one
         | pro & one con to an approach you're considering, you don't
         | adequately understand the approach you're considering.
        
         | dragontamer wrote:
         | What's wrong with the monolith? If that monolith was designed
         | 30 years ago on RAIDed hard drives, upgrading to an all-SSDs
         | 4TB RAM system today probably would solve all of those issues.
         | 
         | Really: is the monolith problem so difficult that a $400,000
         | computer can't solve the problem? Is there any developer
         | project you can fund for say $1 million ($400k for the
         | computer, 600k for system IT time to transfer the data to the
         | new computer) that would give the same return on investment?
         | 
         | $400k gets you an all RAIDed PCIe 4.0 SSDs + 4TBs of RAM,
         | 128-core dual-socket EPYC or something. That __probably__ can
         | run your monolith.
        
       | Animats wrote:
       | The author clearly hasn't talked to game developers. Their users
       | care about latency and will scream about 30ms delays. So game
       | developers obsess on this.
       | 
       | Much of this involves getting things off the critical path.
       | Background updating isn't as time-sensitive.
        
       | dragontamer wrote:
       | Latency exists, but no one cares about it until it crosses over a
       | critical threshold!!
       | 
       | Throughput is the number most people care about. As long as
       | latency remains "below the critical rate" (which is application
       | dependent), throughput remains the more important figure.
       | 
       | As such: we can perform latency/throughput tradeoffs in practice.
       | Maybe even latency/throughput/simplicity tradeoffs.
       | 
       | * The absolute lowest latency is a single-thread system that
       | blocks on everything. Wait until X is ready and immediately start
       | doing Y. This happens to be very simple and elegant code in
       | practice, but it can only do 1-thing at a time.
       | 
       | * However, people want more throughput. You can use pthreads, or
       | golang / fibers / cooperative threads, to convert the "single
       | threaded" code into higher-throughput code. (Get the CPU to "work
       | on something else" while waiting for X). This makes latency
       | worse, but increases throughput dramatically. Multi-core
       | accelerates this pattern very naturally.
       | 
       | * For highest levels of throughput and lowest latency, you need
       | an event-driven loop. Yes, the GetMessageW() loop in Win32, game-
       | loops in video games, poll() in Linux, and the like. This is a
       | bit difficult to use in practice, so people use async to help
       | decompose the "big loop" in practice. Generalizing to multicore
       | is difficult however. But virtually every "high performance"
       | system I've ever seen comes down to some glorified event loop /
       | poll / epoll / async / GetMessageW() pattern. Literally all of
       | the ones I've ever seen.
       | 
       | -------
       | 
       | I'd say, work on throughput until latency becomes an issue.
       | pthreads / fibers are your #1 easiest tool IMO to reach for, and
       | have adequate performance for ~100,000 events per second or so.
        
         | bob1029 wrote:
         | > However, people want more throughput.
         | 
         | This is where the dragons enter the room. 9/10 times the thing
         | that the business wants to make go faster is almost exclusively
         | a serialized narrative of events that has to occur in a certain
         | order.
         | 
         | For problems that do not demand a global serialized narrative
         | of events, we can certainly throw Parallel.ForEach at it and
         | call it a day.
         | 
         | For everything else, you want a ring buffer with spinwait &
         | (micro)batch processing. 100k events per second is paltry
         | compared to what is possible if you take advantage of
         | pipelining and cache in modern CPUs. I have personally tested
         | practical code piles that can handle ~14 million events per
         | second _with_ persistence to disk using only 1 thread the
         | entire time. Impractical academic test cases can get to upwards
         | of half a billion events per second on a single thread:
         | 
         | https://medium.com/@ocoanet/improving-net-disruptor-performa...
         | 
         | You would think low latency and "batch" processing would not go
         | hand-in-hand, but they certainly do when you are dealing with
         | things at scale and need to take advantage of aggregate effects
         | across all users. This technique also has the effect of
         | substantially reducing jitter. It's amazing what is possible if
         | you can keep everything warmed up.
        
           | dragontamer wrote:
           | > For everything else, you want a ring buffer with spinwait &
           | (micro)batch processing.
           | 
           | Note: orders are only globally defined with single-consumer /
           | single-producer. Which... probably should just be a function-
           | call in most situations. (Notable exception: the consumer
           | wants to stay "hot" in consumer code, and the producer wants
           | to stay "hot" in producer code. 2-threads, one for the
           | producer, one for the consumer).
           | 
           | If you even go to single-consumer / multi-producer, then
           | order is no longer defined. Ex: Producer A creates A1, A2,
           | A3, A4. Producer B creates B1, B2, B3.
           | 
           | Single-consumer/multi-producer can consume in many orders:
           | A1, B1, A2, B2, A3, A4 for example. Or maybe even A1, A2, A3,
           | A4, B1, B2, B3 is also valid. Or maybe B1, B2, B3, A1, A2,
           | A3, A4.
           | 
           | Multi-consumer/multi-producer is even "better". Multi-
           | consumer means that you can "execute" in A4, A3, A2, A1, B3,
           | B2, B1. (Lets say you have 4x consumers, all working on A1,
           | A2, A3, and A4. Well, Consumer1 takes A1, but Consumer1 is
           | slower because L1 cache wasn't warm, and Consumer4 had L1
           | cache warmed up just right. That means Consumer4 executing A4
           | will finish first).
           | 
           | As such: sequentially-consistent spinwait is good, but
           | suboptimal. The answer is to allow __race-conditions__ to
           | pick the order, because no one cares anymore about ordering
           | if they're asking for multi-consumer/multi-producers.
           | 
           | ---------
           | 
           | > 100k events per second is paltry compared to what is
           | possible if you take advantage of pipelining and cache in
           | modern CPUs.
           | 
           | I agree. But 100k events per second is more than enough for a
           | great number of tasks. There's a degree of simplicity to
           | fork/join or pthread_create / pthread_joins. Especially if
           | you avoid async-style code.
           | 
           | Keeping all your logic together (instead of spread out across
           | a variety of async functions) can be beneficial. Sure its
           | suboptimal, but you don't want to make things harder on
           | yourself unless you actually need that performance.
        
             | bob1029 wrote:
             | > I agree. But 100k events per second is more than enough
             | for a great number of tasks. There's a degree of simplicity
             | to fork/join or pthread_create / pthread_joins. Especially
             | if you avoid async-style code.
             | 
             | 100%. There is certainly an additional complexity cost to
             | be paid if you want (need) to go beyond 7 figure-per-second
             | numbers. Redefining a problem to fit a ring buffer is a lot
             | harder than throwing basic threading primitives at it.
        
       ___________________________________________________________________
       (page generated 2021-10-20 23:02 UTC)