[HN Gopher] Are efficiency and horizontal scalability at odds?
       ___________________________________________________________________
        
       Are efficiency and horizontal scalability at odds?
        
       Author : todsacerdoti
       Score  : 30 points
       Date   : 2025-02-12 18:27 UTC (4 hours ago)
        
 (HTM) web link (buttondown.com)
 (TXT) w3m dump (buttondown.com)
        
       | xzyyyz wrote:
       | not convincing. (horizontal) scalability comes at cost, but it
       | changes size of the problem we can handle considerably.
        
       | datadrivenangel wrote:
       | "The downside is that for the past couple of decades computers
       | haven't gotten much faster, except in ways that require recoding
       | (like GPUs and multicore)."
       | 
       | This is false? Computers have gotten a lot faster, even if the
       | clock speed is not that much faster. A single modern CPU core
       | turboing at ~5Ghz is going to be significantly faster than a 20
       | year old cpu overclocked to ~4.5Ghz.
        
         | jeffbee wrote:
         | Yeah, that detail sinks the rest of it. Even if we assume
         | datacenter CPUs were the market preference has been for more
         | cores operating at the same ~2400MHz speed for a long time,
         | what you get for 1 CPU-second these days is ridiculous compared
         | to what you could have gotten 20 years ago. We're talking about
         | NetBurst Xeons as a baseline.
        
         | gopalv wrote:
         | > Computers have gotten a lot faster, even if the clock speed
         | is not that much faster
         | 
         | We're not stagnating but the same code I thought was too slow
         | in 1998 was good enough in 2008, which is probably not true for
         | code I would've thrown away in 2015.
         | 
         | The only place where that has happened in the last decade is
         | for IOPS - old IOPS heavy code which would have been rewritten
         | with group-commit tricks is probably slower than a naive
         | implementation that fsync'd all the time. A 2015 first cut of
         | IO code probably beats the spinning disk optimized version from
         | the same year on modern hardware.
         | 
         | The clock-speed comment is totally on the money though - a lot
         | of the clocks were spent waiting for memory latencies and those
         | have improved significantly across the years particularly if
         | you use an Apple Silicon style memory which is physically
         | closer in a light cone from the DIMMs of the past.
        
           | Legend2440 wrote:
           | A lot of clocks are _still_ spent waiting for memory. GPUs in
           | particular are limited by memory bandwidth despite a memory
           | bus that runs at terabytes per second.
           | 
           | Back when I started programming, it was reasonable to
           | precompute lookup tables for multiplications and trig
           | functions. Now you'd never do that - it's far cheaper to
           | recompute it than to look it up from memory.
        
         | paulsutter wrote:
         | Could you share some numbers on this? Lots of folks would be
         | interested I'm sure
        
         | PaulKeeble wrote:
         | An intel 12900k (Gen 12) compared to a 2600k (Gen 2, launched
         | 2011) is about 120% faster or a bit over 2 times in single
         | threaded applications, those +5-15% uplifts every generation
         | add up over time but its nothing like the earlier years when
         | they might double in performance in a single generation.
         | 
         | It really depends if that application uses AES 256 bit and
         | other modern instructions. The 12900k has 16 cores vs 4 of the
         | 2600k, although 8 of those extra cores are E-cores. This
         | performance increase doesn't necessarily come from free given
         | the application may need to be adjusted to utilise those extra
         | cores especially when half of them are slower to ensure the
         | workload is distributed properly.
         | 
         | Even within a vertical scaling by getting a new processor for
         | just single threaded applications its interesting that much of
         | the big benefits come from targeting the new instructions and
         | then the new cores. Both of which may require source updates to
         | get significant performance uplift from.
         | 
         | https://www.cpu-monkey.com/en/compare_cpu-intel_core_i7_1270...
        
           | einpoklum wrote:
           | > is about 120% faster or a bit over 2 times in single
           | threaded applications
           | 
           | 1. Doesn't that also account for speedups in memory and I/O?
           | 
           | 2. Even if the app is single-threaded, the OS isn't, so
           | unless it's very very inactive other than the foreground
           | application (which is possible), there might still be an
           | effect of the higher core count.
        
             | jaggederest wrote:
             | Unless you're multitasking, the OS on a separate thread
             | gets you about 5-10% speedup. It's not really noteworthy.
             | 
             | Unless you lived through the 1990s I don't think you
             | understand how fast things were improving. Routine doubling
             | of scores every 18 months is an insane thing. In 1990 the
             | state of the art was 8mhz chips. By 2002, the state of the
             | art was a 5ghz chip. So almost a thousand times faster in a
             | decade.
             | 
             | Are chips now a thousand times faster than they were in
             | 2015? No they are not.
        
               | sidewndr46 wrote:
               | What does "the OS on a separate thread" mean? I'm also
               | not aware of any consumer chips running 5 GHz in 2002
        
             | no_wizard wrote:
             | Funnily enough, most apps aren't taking enough advantage of
             | multi-core multi-threading environments that are common
             | across all major platforms.
             | 
             | The single biggest bottleneck to improvement is the general
             | lack of developers using the APIs to the fullest extent
             | when designing applications. Its not really hardware
             | anymore.
             | 
             | Though, to the points being made, we aren't seeing the 18
             | month doubling like we did in the earlier decades of
             | computing.
        
         | bee_rider wrote:
         | I think it is often the case that people want to describe the
         | problem as "single core performance has stagnated for decades"
         | because it makes it look like their solution is _necessary to
         | make any progress at all_.
         | 
         | Actually, single core performance has been improving. Not as
         | fast as it was in the 90's maybe, but it is improving.
         | 
         | However, we can speed things up even more by using multiple
         | computers. And it is a really interesting problem where you get
         | to worry about all sorts of fun things, like hiding MPI
         | communication between compute.
         | 
         | Nobody wants to say "I have found that if I can make an already
         | fast process even faster by putting in a lot of effort, which I
         | will do because my job is actually really fun." Technical jobs
         | are supposed to be stressful and serious. The world is doomed
         | and science will stop... unless I come up with a magic trick!
        
           | Legend2440 wrote:
           | Single-core performance looks pretty stagnant on this graph,
           | especially in the last ten years: https://imgur.com/DrOvPZt
           | 
           | Transistor count has continued to increase exponentially, but
           | single-threaded performance has improved slowly and appears
           | to be leveling off. We may never get another 100x or even 10x
           | improvement in single-threaded performance.
           | 
           | It is going to be necessary to parallelize to see gains in
           | the future.
        
             | achierius wrote:
             | But it's not flat? 10% growth a year is still growth.
        
         | Ygg2 wrote:
         | > This is false? Computers have gotten a lot faster
         | 
         | Depends, what you mean by much. Single threaded performance is
         | no longer 2x fast after a year. I mean, even in the GPU
         | section, you get graphics that looks slightly better for 2-4x
         | the cost (see street prices of 2080 vs 3080 vs 4080).
         | 
         | Computing has hit the point of diminishing returns, exponetial
         | growth for linear prices is no longer possible.
        
         | foota wrote:
         | I think this is meant to be read as, "over the past decade, you
         | haven't been able to wait a year and buy a new CPU to solve
         | your vertical scalability issues.", not necessarily to claim
         | that there hasn't been significant growth when compared over
         | the entire window.
        
       | jeeyoungk wrote:
       | DuckDB would've been a good example to be included, because it
       | tries to target the need for horizontal scalability with an
       | efficient implementation altogether. If your use case stays below
       | the need for horizontal scalability (which in the modern world,
       | mixture of clever implementation and crazy powerful computers do
       | allow), then you can tackle quite a large workload.
        
         | memhole wrote:
         | And even then you have things like this:
         | 
         | https://www.boilingdata.com/
        
       | awkward wrote:
       | I suppose if you're doing one you're not doing the other - the
       | promise of future horizontal scale definitely justifies a lot of
       | arguments about premature optimization.
       | 
       | However, they aren't necessarily opposed. Optimization is usually
       | subtractive - it's slicing parts off the total runtime.
       | Horizontal scale is multiplicative - you're doing the same thing
       | more times. Outside some very specific limits, usually efficiency
       | means horizontal scaling is more effective. A slightly shorter
       | runtime many times over means a much shorter runtime.
        
       | Joel_Mckay wrote:
       | Depends what you are optimizing, and whether your design uses
       | application layer implicit load-balancing. Thus, avoiding
       | constraints within the design patterns before they hit the
       | routers can often reduce traditional design cost by 37 times or
       | more.
       | 
       | YMMV, depends if your data stream states are truly separable. =3
        
       | einpoklum wrote:
       | I'd say they're not fundamentally at odds, but they're at odds
       | with a "greedy approach". That is, it is much easier to scale out
       | when you're willing to make constraining assumptions about your
       | program; and willing to pay a lot of overhead for distributed
       | resource management, migrating pieces of work etc. If you want to
       | scale while maintaining efficiency, you have to be aware of more
       | things about the work that's being distributed; you have to
       | struggle much harder to avoid different kinds of overhead and
       | idleness; and if you really want to go the extra mile you need to
       | think of how to turn the distribution partially to your _benefit_
       | (example: Using the overhead you pay for fault-tolerance or high-
       | availability by storing copies of your data in different formats,
       | allowing different computations to prefer one format over the
       | other; while on a single machine you wouldn't even have the extra
       | copies).
        
       ___________________________________________________________________
       (page generated 2025-02-12 23:00 UTC)