[HN Gopher] AMD Disables Zen 4's Loop Buffer
       ___________________________________________________________________
        
       AMD Disables Zen 4's Loop Buffer
        
       Author : luyu_wu
       Score  : 65 points
       Date   : 2024-11-30 20:47 UTC (2 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | syntaxing wrote:
       | Interesting read, one thing I don't understand is how much space
       | does loop buffer take on the die? I'm curious with it removed, on
       | future chips could you use the space for something more useful
       | like a bigger L2 cache?
        
         | progbits wrote:
         | It says 144 micro-op entries per core. Not sure how many bytes
         | that is, but L2 caches these days are around 1MB per core, so
         | assuming the loop buffer die space is mostly storage (sounds
         | like it) then it wouldn't make a notable difference.
        
         | Remnant44 wrote:
         | My understanding is that it's a pretty small optimization on
         | the front end. It doesn't have a lot of entries to begin with
         | (144) so the amount of space saved is probably negligible.
         | Theoretically, the loop buffer would let you save power or
         | improve performance in a tight loop. In practice, it doesn't
         | seem to do either, and AMD removed it completely for Zen 5.
        
         | akira2501 wrote:
         | I think most modern chips are routing constrained and not
         | floorspace constrained. You can build tons of features but
         | getting them all power and normalized signals is an absolute
         | chore.
        
       | eqvinox wrote:
       | > Strangely, the game sees a 5% performance loss with the loop
       | buffer disabled when pinned to the non-VCache die. I have no
       | explanation for this, [...]
       | 
       | With more detailed power measurements, it could be possible to
       | determine if this is thermal/power budget related? It does sound
       | like the feature was intended to conserve power...
        
       | Pannoniae wrote:
       | From another article:
       | 
       | "Both the fetch+decode and op cache pipelines can be active at
       | the same time, and both feed into the in-order micro-op queue.
       | Zen 4 could use its micro-op queue as a loop buffer, but Zen 5
       | does not. I asked why the loop buffer was gone in Zen 5 in side
       | conversations. They quickly pointed out that the loop buffer
       | wasn't deleted. Rather, Zen 5's frontend was a new design and the
       | loop buffer never got added back. As to why, they said the loop
       | buffer was primarily a power optimization. It could help IPC in
       | some cases, but the primary goal was to let Zen 4 shut off much
       | of the frontend in small loops. Adding any feature has an
       | engineering cost, which has to be balanced against potential
       | benefits. Just as with having dual decode clusters service a
       | single thread, whether the loop buffer was worth engineer time
       | was apparently "no"."
        
       | londons_explore wrote:
       | The article seems to suggest that the loop buffer provides no
       | performance benefit and no power benefit.
       | 
       | If so, it might be a classic case of "Team of engineers spent
       | months working on new shiny feature which turned out to not
       | actually have any benefit, but was shipped anyway, possibly so
       | someone could save face".
       | 
       | I see this in software teams when someone suggests it's time to
       | rewrite the codebase to get rid of legacy bloat and increase
       | performance. Yet, when the project is done, there are more lines
       | of code and performance is worse.
       | 
       | In both cases, the project shouldn't have shipped.
        
         | adgjlsfhk1 wrote:
         | > but was shipped anyway, possibly so someone could save face
         | 
         | no. once the core has it and you realize it doesn't help much,
         | it absolutely is a risk to remove it.
        
         | akira2501 wrote:
         | > but was shipped anyway, possibly so someone could save face
         | 
         | Was shipped anyway because it can be disabled with a firmware
         | update and because drastically altering physical hardware
         | layouts mid design was likely to have worse impacts.
        
       | londons_explore wrote:
       | In the "power" section, it seems the analysis doesn't divide by
       | the number of instructions executed per second.
       | 
       | Energy used per instruction is almost certainly the metric that
       | should be considered to see the benefits of this loop buffer, not
       | energy used per second (power, watts).
        
       | rasz wrote:
       | Anecdotally one of very few differences between 1979 68000 and
       | 1982 68010 was addition of "loop mode", a 6 byte Loop Buffer :)
        
       ___________________________________________________________________
       (page generated 2024-11-30 23:00 UTC)