[HN Gopher] It's Not Always iCache
       ___________________________________________________________________
        
       It's Not Always iCache
        
       Author : tjalfi
       Score  : 61 points
       Date   : 2021-07-12 14:15 UTC (8 hours ago)
        
 (HTM) web link (matklad.github.io)
 (TXT) w3m dump (matklad.github.io)
        
       | [deleted]
        
       | yepthatsreality wrote:
       | Off topic, but shouldn't it be ICache or Icache, but not iCache.
       | This seems like a misnomer from too much Apple products
       | consumption/dogma?
       | 
       | I could only find references to ICache or ICACHE via quick
       | search.
        
         | gpderetta wrote:
         | I$ is the usual cute name canonical shorthand.
        
           | Espressosaurus wrote:
           | the point being that lower-case i is an Apple-ism, while the
           | correct casing is upper-case I (I've also seen all lower case
           | when used informally, e.g. icache, iram, etc.)
        
         | ndesaulniers wrote:
         | I have trouble distinguishing between uppercase i vs lowercase
         | L in my system fonts, and am too lazy to change them, so I
         | appreciate the author using iCache for readability.
        
       | gpderetta wrote:
       | Interesting article.
       | 
       | Because of branch prediction and deep instruction windows, the
       | instruction prefetcher is significantly more effective than the
       | data prefetcher.
       | 
       | There are of course second order effects of code bloat: L2 for
       | example is typically shared with data and wasting more of it for
       | code can have negative effects.
        
       | steerablesafe wrote:
       | One optimization I never saw is to adjust the stack within the
       | function, other than the beginning and end. Because of this
       | inlining can significantly blow up stack space.
       | 
       | https://godbolt.org/z/Gaa4MEMnK
        
         | neerajsi wrote:
         | The optimization exists in MSVC, LLVM, and GCC. It's called
         | "shrink wrapping". MSVC may do it more aggressively when
         | profile information is available.
         | 
         | See https://github.com/gcc-mirror/gcc/blob/master/gcc/shrink-
         | wra... or https://llvm.org/doxygen/ShrinkWrap_8cpp_source.html.
        
           | leni536 wrote:
           | None of gcc, clang or msvc deallocate the frame or part of
           | the frame in `baz2` before calling `bar` the first time here:
           | 
           | https://godbolt.org/z/rvx36sM74
           | 
           | What I would like to like is something like the generated
           | call of `call_foo1` substituted verbatim into the generated
           | code of `baz1` in the place of the function call. That way at
           | the point of calling `bar` there is much less stack space
           | allocated, minimizing stack usage.
           | 
           | But maybe this would pessimize other things, or for some
           | weird reason is actually incorrect.
           | 
           | gcc at least deallocates the stack before tail-calling `baz`,
           | I don't know if that is "shrink wrapping" or just plain TCO.
        
             | ndesaulniers wrote:
             | You can get into weird cases with DWARF based unwinders
             | when two paths through a function with different stack
             | depths makes it impossible to reliably unwind.
             | 
             | LLVM has bugs in this regard with calls to variadic calls,
             | since after a certain number of arguments have been passed
             | in registers, you start pushing parameters on the stack
             | (ABI dependant).
        
       | electricshampo1 wrote:
       | Very often people are looking at icache misses instead of
       | something more precise when regarding perf effects due to code
       | size/layout, etc. That more precise thing is frontend stalls: you
       | only care about misses when they cause stalls; otherwise they are
       | overlapped with actual work being done by the execution units.
       | 
       | You can measure frontend stalls on many recent intel chips by
       | 
       | IDQ_UOPS_NOT_DELIVERED.CORE
       | 
       | https://perfmon-events.intel.com/
       | 
       | Neoverse N1 from Arm has STALL_FRONTEND; see
       | 
       | https://developer.arm.com/documentation/PJDOC-466751330-5476...
        
         | infberg wrote:
         | I agree with you that one can very often get distraced by
         | single events, however knowing that you are frontend/backend
         | bound isn't all that more helpful either.
         | 
         | For frontend you can guess that PGO, BOLT, huge tables might
         | probably help but it's still a blind guess without knowing what
         | to look at next.
         | 
         | Intel's TMA is the only helpful thing here really. Bit sad that
         | AMD and ARM don't provide a way to calculate something TMA-like
         | themselves.
        
       | superdimwit wrote:
       | Could a difference in alignment of the hot loop also have an
       | effect here?
        
       | tjalfi wrote:
       | (submitter)
       | 
       | This is a follow up to Inline in Rust[0] which was submitted a
       | couple times earlier this week.
       | 
       | [0] https://matklad.github.io//2021/07/09/inline-in-rust.html
        
       | CalChris wrote:
       | Is the result different for this benchmark for an ARMv8 cpu with
       | its ~31 registers vs an x86_64 CPU with its ~15 registers? For
       | example, M1 vs Skylake?
        
       ___________________________________________________________________
       (page generated 2021-07-12 23:01 UTC)