[HN Gopher] It's Not Always iCache
___________________________________________________________________
It's Not Always iCache
Author : tjalfi
Score : 61 points
Date : 2021-07-12 14:15 UTC (8 hours ago)
(HTM) web link (matklad.github.io)
(TXT) w3m dump (matklad.github.io)
| [deleted]
| yepthatsreality wrote:
| Off topic, but shouldn't it be ICache or Icache, but not iCache.
| This seems like a misnomer from too much Apple products
| consumption/dogma?
|
| I could only find references to ICache or ICACHE via quick
| search.
| gpderetta wrote:
| I$ is the usual cute name canonical shorthand.
| Espressosaurus wrote:
| the point being that lower-case i is an Apple-ism, while the
| correct casing is upper-case I (I've also seen all lower case
| when used informally, e.g. icache, iram, etc.)
| ndesaulniers wrote:
| I have trouble distinguishing between uppercase i vs lowercase
| L in my system fonts, and am too lazy to change them, so I
| appreciate the author using iCache for readability.
| gpderetta wrote:
| Interesting article.
|
| Because of branch prediction and deep instruction windows, the
| instruction prefetcher is significantly more effective than the
| data prefetcher.
|
| There are of course second order effects of code bloat: L2 for
| example is typically shared with data and wasting more of it for
| code can have negative effects.
| steerablesafe wrote:
| One optimization I never saw is to adjust the stack within the
| function, other than the beginning and end. Because of this
| inlining can significantly blow up stack space.
|
| https://godbolt.org/z/Gaa4MEMnK
| neerajsi wrote:
| The optimization exists in MSVC, LLVM, and GCC. It's called
| "shrink wrapping". MSVC may do it more aggressively when
| profile information is available.
|
| See https://github.com/gcc-mirror/gcc/blob/master/gcc/shrink-
| wra... or https://llvm.org/doxygen/ShrinkWrap_8cpp_source.html.
| leni536 wrote:
| None of gcc, clang or msvc deallocate the frame or part of
| the frame in `baz2` before calling `bar` the first time here:
|
| https://godbolt.org/z/rvx36sM74
|
| What I would like to like is something like the generated
| call of `call_foo1` substituted verbatim into the generated
| code of `baz1` in the place of the function call. That way at
| the point of calling `bar` there is much less stack space
| allocated, minimizing stack usage.
|
| But maybe this would pessimize other things, or for some
| weird reason is actually incorrect.
|
| gcc at least deallocates the stack before tail-calling `baz`,
| I don't know if that is "shrink wrapping" or just plain TCO.
| ndesaulniers wrote:
| You can get into weird cases with DWARF based unwinders
| when two paths through a function with different stack
| depths makes it impossible to reliably unwind.
|
| LLVM has bugs in this regard with calls to variadic calls,
| since after a certain number of arguments have been passed
| in registers, you start pushing parameters on the stack
| (ABI dependant).
| electricshampo1 wrote:
| Very often people are looking at icache misses instead of
| something more precise when regarding perf effects due to code
| size/layout, etc. That more precise thing is frontend stalls: you
| only care about misses when they cause stalls; otherwise they are
| overlapped with actual work being done by the execution units.
|
| You can measure frontend stalls on many recent intel chips by
|
| IDQ_UOPS_NOT_DELIVERED.CORE
|
| https://perfmon-events.intel.com/
|
| Neoverse N1 from Arm has STALL_FRONTEND; see
|
| https://developer.arm.com/documentation/PJDOC-466751330-5476...
| infberg wrote:
| I agree with you that one can very often get distraced by
| single events, however knowing that you are frontend/backend
| bound isn't all that more helpful either.
|
| For frontend you can guess that PGO, BOLT, huge tables might
| probably help but it's still a blind guess without knowing what
| to look at next.
|
| Intel's TMA is the only helpful thing here really. Bit sad that
| AMD and ARM don't provide a way to calculate something TMA-like
| themselves.
| superdimwit wrote:
| Could a difference in alignment of the hot loop also have an
| effect here?
| tjalfi wrote:
| (submitter)
|
| This is a follow up to Inline in Rust[0] which was submitted a
| couple times earlier this week.
|
| [0] https://matklad.github.io//2021/07/09/inline-in-rust.html
| CalChris wrote:
| Is the result different for this benchmark for an ARMv8 cpu with
| its ~31 registers vs an x86_64 CPU with its ~15 registers? For
| example, M1 vs Skylake?
___________________________________________________________________
(page generated 2021-07-12 23:01 UTC)