[HN Gopher] Beating the L1 cache with value speculation
___________________________________________________________________
Beating the L1 cache with value speculation
Author : rostayob
Score : 116 points
Date : 2021-07-23 11:44 UTC (11 hours ago)
(HTM) web link (mazzo.li)
(TXT) w3m dump (mazzo.li)
| asoylu wrote:
| This works when list nodes are allocated subsequently without
| other irrelevant allocations intervened. If your case doesn't fit
| this constraint, value speculation doesn't seem to work. Any
| comments?
| drewg123 wrote:
| How realistic is it to have a linked list in contiguous memory
| that stays nicely ordered? In my experience, lists are normally
| cobbled together with nodes that are allocated widely separated
| in time and hence I have no knowledge that I can use to predict
| the address of the next pointer.
|
| If you're in a scenario where this optimization works, it seems
| like it would be better to just use an array.
| dragontamer wrote:
| Are you familiar with Donald Knuth's "Dancing Links" ??
|
| Its a covering-problem solver (NP-complete) that uses linked
| lists highly-efficiently. All nodes are "statically allocated"
| (at least, with respect to the algorithm), because the
| covering-problem takes up a constant amount of space.
|
| The linked-list "Dances", pointing at other nodes to represent
| which combination of covers are attempted. So any link could
| potentially point to any other link later in its list. However,
| the number of nodes themselves is constant and set before the
| algorithm runs.
|
| As such, the linked-list is fully X1->X2->X3->X4... at the
| start of the algorithm. (where X1, X2, X3 are contiguously
| allocated in memory). If X3 is proven to "not be a good guess"
| for our covering problem, then it is cut out of the linked list
| (X1->X2->X4).
|
| All links remain in order, and are ordered on a 2-dimensions
| (So not only X1->X2->X3... but also X1->Y1->Z1...). Where X, Y
| and Z are the compound-elements trying to cover locations 1, 2,
| 3, ...
|
| ------------
|
| Most simple malloc implementations also have the free-list as a
| simple linked-list that is in fact, ordered in memory. (The
| first node in the free-list is the lowest numbered RAM spot,
| while the final node in the free-list is in the highest-
| numbered RAM spot).
|
| The FAT32 linked-list across a hard drive is also nicely
| ordered (and when it isn't, the user will perform
| "defragmentation" to reorder the list and optimize the hard
| drive).
|
| -----------
|
| IMO, the "nicely ordered Linked List" is uncommon, but...
| common enough that its worth discussion. And sure, maybe we
| don't use FAT32 filesystems today very often, but that's the
| technology I used growing up lol. I know the methodology
| works!!
| virtue3 wrote:
| I think you need to be aware of why such an implementation
| (fat32) had to exist. As I believe (I also grew up with
| fat32) that it basically vanished when HDDs with caches in
| the MB started to show up. Journaling became a thing; actual
| caches became a thing so the HDD could actually lay things
| out well by utilizing said cache.
|
| NTFS still needed to occasionally be defragmented but it
| allocated more of a buffer to prevent having to do so at all.
|
| I think; in general; You're very spot on about utilizing
| continuous arrays of memory for linked lists as they will
| almost always result in better performance characteristic
| (unless you really need insertion/removal in order and it's
| very write heavy).
| ot wrote:
| You can get help from the allocator. For example jemalloc has
| an experimental API called "batch allocation" where it returns
| a list of allocations of the same size, and it makes its best
| effort to make them contiguous (not guaranteed though):
| https://github.com/jemalloc/jemalloc/blob/12cd13cd418512d9e7...
|
| The idea is that you may have a data structure that you want to
| be able to modify (in which case you need to free individual
| nodes) but in most cases you do not.
|
| Then you could use batch allocation, and pair it with this
| optimization.
| Filligree wrote:
| It's fairly standard for a compacting GC to rearrange lists
| this way, because memory latency was a concern even back in the
| 80s.
|
| If you're using linked lists in a language without a compacting
| GC, then you _almost certainly_ want something else instead.
|
| If you're using them in a language with one, then you still
| usually want something else, but at least the cost isn't as
| high. Also you might be using Haskell or something, where other
| constraints make data-structures like arrays painful to use.
| adrianN wrote:
| That depends on the allocator you use. If you allocate your
| nodes from an arena that belong to the list, the resulting
| layout is usually pretty good.
| thereare5lights wrote:
| Would this introduce spectre like vulnerabilities?
| kardos wrote:
| This works because we've arranged for the nodes to be sequential
| in memory, if we have the luxury of doing that then we could use
| an array instead of a linked list. Is there a scenario where this
| value speculation trick would work, where we /don't/ have this
| luxury?
| twoodfin wrote:
| Given the nature of branch prediction, I'd expect it would
| still help dramatically in the case where the list was _almost_
| entirely sequential, which seems a bit more plausible.
| kardos wrote:
| Agree; if a mispredict costs 15-20 cycles and a predict saves
| 4 cycles, does that mean we break even around 1/4-1/3
| sequential?
| munchbunny wrote:
| I could imagine behaviors like this happening with object
| pools, though it does still feel a bit contrived.
| codetrotter wrote:
| How do they get the cycles/elem and instrs/cycle counts? Is it
| via the perf utility? Or something they calculate? Or they get it
| some other way?
| mhh__ wrote:
| Some tools you can do this with:
|
| The perf command line utility, the perf_event_open system call,
| the libpapi high level API, pmc.h on FreeBSD (IIRC).
|
| Now, some more clever analysis: These can tell you the why
| rather than the what, they collect the same information but can
| present it graphically and compute TMAM statistics for example
| (Top-down microarchitecture analysis method).
|
| vTune, Intel Advior, AMD uProf, and a few others. vTune is the
| best for x86, but only supports Intel fully.
| dragontamer wrote:
| Personally, I prefer to just use the RDTSC assembly instruction
| (Read TimeStamp Counter), which provides the 64-bit number of
| clocks your core has ticked.
|
| Both Windows and Linux provide more robust timers (in
| particular: if your thread takes longer than 10ms, there's a
| chance your thread will sleep to give other threads a shot at
| the CPU). So if you're timing something longer than 10ms, you
| probably want to use OS timers instead.
|
| -------------
|
| There were a few programs where I couldn't add rdtsc easily to
| the code (in particular: I was trying to test something so fast
| that rdtsc took up the bulk of the time). In these cases, I
| went into the BIOS, disabled "turbo" on my CPU, locking my
| computer to 3.4GHz.
|
| From there, I took the Windows timer and measured 1-billion
| events, and then divided by 3.4-Billion (3.4GHz == 3.4-billion
| clocks per second).
|
| ---------
|
| I don't know the specific methodology that the blogpost used.
| But there's many easy ways to do this task.
| secondcoming wrote:
| There's a bit more to it than turning a flag off in your
| BIOS.
|
| Intel have a document about it [0]
|
| [0] https://www.intel.com/content/dam/www/public/us/en/docume
| nts...
| mhh__ wrote:
| > , which provides the 64-bit number of clocks your core has
| ticked.
|
| Not quite
| dragontamer wrote:
| I mean... computers are complex these days. I could type up
| like 3 paragraphs that more specifically describes what is
| going on but is that really helpful?
|
| Yeah, pipelines and out-of-order exeuction makes the
| definition a bit difficult. If you want to ensure that all
| previous instructions are done executing, you need lfence,
| and if you want to prevent future instructions from filling
| in the pipelines you'll need an mfence.
|
| There are many clocks (even within a core). The turbo-clock
| is different from the standard clock. I forget exactly
| which clock rdtsc uses, but I do know that under some
| processors under certain conditions, you'll get weird
| results.
|
| Different processors may have different interpretations of
| "clock" (mostly due to turbo and/or sleeping behavior).
| Etc. etc. I don't recall the details, but these different
| clock states could vary as much as 2.2GHz to 4GHz on my
| processor (P1? Turbo? I forget the exact name...)
|
| ---------------
|
| But all in all, you get a 64-bit number that describes the
| number of clock-ticks --- for some "definition" of clock
| tick that differs between processors... and for some
| definition of "now" (in the case of out-of-order execution
| and/or pipelined execution, the "now" is a bit ambiguous,
| as previous instructions may have not finished executing
| yet and future instructions may already be executing).
|
| If you really want to know, read the processor manual
| specific to the microarchitecture (since different
| microarchitectures could change these definitions)
| titzer wrote:
| > If you want to ensure that all previous instructions
| are done executing, you need lfence, and if you want to
| prevent future instructions from filling in the pipelines
| you'll need an mfence.
|
| LFENCE does not serialize, nor MFENCE. CPUID, however, is
| documented to as a serializing instruction and is the
| recommended way to serialize, particularly with RDTSC.
|
| > I don't recall the details, but these different clock
| states could vary as much as 2.2GHz to 4GHz on my
| processor (P1? Turbo? I forget the exact name...)
|
| Oh heck, it's way more than that. I've measured ~5x
| difference in clock cycle count for short loops using
| RDTSC. Supposedly RDTSC returns "nominal" cycles that
| advance at the same rate relative to the wall clock, but
| TBH that doesn't smell right. OSes also try to
| synchronize the absolute values of the various
| processors, so jumping between CPUs isn't that bad.
| jeffbee wrote:
| The program is reading perf events.
|
| The code is at
| https://gist.github.com/bitonic/78887f5d3238bab5e31f3c5a41d4...
| solidangle wrote:
| ``` if (node != next) { node = node->next; } ``` How does this
| work? Shouldn't this be ``` if (node != next) { node = next; }
| ```?
| [deleted]
| xfer wrote:
| while (node) { value += node->value; next =
| node->next;
|
| So it is the same thing.
| tinus_hn wrote:
| That looks weird, why would you need to test if two values are
| equal if you're going to assign them to be equal anyway?
| xfer wrote:
| That is the trick.
|
| In the happy path you are not assigning(`node=next`).
|
| It is taken care of by `node++`, which removes the loop
| dependency and the processor can use the full instruction
| level parallelism.
| ectopod wrote:
| It looks like a bug. The unhappy path contains both
| `node++` and `node=node->next`. Note that this is in the
| code following "Let's go back to the code we showed for
| value speculation in C:", which is actually different from
| the preceding code it's supposed to be a copy of. I guess
| it's a typo.
| ectopod wrote:
| The author has quietly fixed this discrepancy. You're
| welcome, no thank you required!
| tinus_hn wrote:
| It looks as if the difference is in delaying the check for
| next == NULL.
| andy_ppp wrote:
| I like it as an exercise for understanding the CPU but is this
| stuff really useful in the real world - the example given is
| "optimised" away by GCC/clang without hacks. My guess is that if
| this is a good way to iterate over linked lists compilers would
| have optimised for this by now? Either through cleverness or
| hints added to your code.
| 10000truths wrote:
| For a compiler to properly do the optimization detailed here
| would require it to have knowledge of runtime execution/memory
| access patterns. This might be done through profile-guided
| optimization, or by providing hints like __builtin_expect.
|
| That said, compilers are not magic boxes. Relying on your
| compiler for optimal codegen has diminishing returns as you
| approach the point where microarchitectural details start
| making a difference. Most people don't get to that point, so
| it's rarely an issue.
| dragontamer wrote:
| Compilers are 'just' graph transformations applied to code
| which probably leads to improved performance.
|
| A compiler can therefore optimize away unnecessary
| loads/stores, or recognize implicit parallelism and transform a
| loop into a SIMD loop.
|
| But a compiler will not change the logic of your code or add
| extra steps.
| matheusmoreira wrote:
| > But a compiler will not change the logic of your code or
| add extra steps.
|
| Compilers can and do delete checks for null pointers, integer
| overflow... The stuff they do when they're faced with simple
| aliasing is honestly pretty crazy.
| andy_ppp wrote:
| I mean this is quite easy to disprove - what do you think
| -funroll-loops is doing under the hood if not adding extra
| steps?
| dragontamer wrote:
| A "loop" is just a cycle in a graph.
|
| A -> B -> C -> A, hey look, its a cycle. Eventually A->D
| (where D exits the loop). So really node A has an edge
| going to B and D.
|
| Knowing that A has an edge to B and D, an equivalent
| transformation is A1->B1->C1->A2->B2->C2->A1, but also
| A1->D and A2->D.
|
| Both A1 and A2 have to still point at D. But A1 points to
| B1, and A2 points to B2. Etc. etc. That's all your compiler
| is doing, its reasoning about the graph and transforming it
| with actually... very simple rules. (Unfortunately, those
| rules are NP-complete so heuristics are used to make the
| compile times reasonable. NP-complete seems to keep coming
| about transformations associated with "simple" rules...)
|
| Then, we might see that some logic could make A2 not
| necessarily point to D (cutting out one possible jump
| statement). And that's where we get our optimization from:
| a simple reasoning about the graph which removes one
| instruction per 2-loops. Unroll to 8 and we get to
| 7-instructions (cmp/jump instructions) saved over every 8
| loops.
|
| An "optimizing" compiler just reasons about this graph and
| all possible equivalent transformations (or at least, uses
| a heuristic to reason over a portion of equivalent
| transformations), and chooses such a graph transformation
| that minimizes various cost estimates.
|
| -------
|
| That's the thing. Compilers can only transform the code-
| graph in ways that are "equivalent". Now yes, things get a
| bit weird with undefined behavior (and pointer-aliasing:
| assuming a lack of aliasing is needed for C/C++ compilers
| to make even the simplest of transformations. So that's a
| bit of a corner case). But that's the gist of what they're
| doing.
| sly010 wrote:
| This is not a 'static' optimization. It relies on knowledge
| that the compiler cannot decide by looking at the code (whether
| the linked list is _mostly_ laid out continuously).
|
| It might still be a very useful optimization in a JIT runtime.
| If your loop variable keeps increasing the same amount after
| every lookup, maybe rewrite the loop to add speculation...
| bob1029 wrote:
| A quick googling brought me to the following paper:
|
| "Data value speculation in superscalar processors"
|
| https://www.sciencedirect.com/science/article/abs/pii/S01419...
|
| It would appear there is an entire section dedicated to "Value
| prediction in a realistic scenarios", but I am not curious enough
| yet to pay money to access this document.
|
| Edit: Found another that is accessible.
|
| "Practical Data Value Speculation for Future High-end Processors"
|
| http://class.ece.iastate.edu/tyagi/cpre581/papers/HPCA14Data...
| swiley wrote:
| This seems like it would cause a lot of those "I just wish the
| damn machine did what I told it to: situations.
| st_goliath wrote:
| Not necessarily. After putting a lot of brain grease into
| writing the code, it will probably work just fine.
|
| But it will lead to a bunch of serious "WTF!?" situations when
| other people look at the code and/or are supposed to make
| changes at a later point.
|
| And possibly stuff breaking in spectacular ways after
| introducing "minor cleanup" patches. Or some weird performance
| regressions, after touching completely unrelated code that
| might affect the temporal order of the memory allocations in
| questions, and so on...
|
| Besides improving performance, doing this kind of trickery in
| production code will probably also improve your job security.
| pranith wrote:
| Very well explained.. its a really interesting idea.
| mhh__ wrote:
| I was under the impression that high end CPUs already perform
| value prediction while speculating?
| gpderetta wrote:
| There are some restricted form of value prediction in the wild:
| Indirect branch prediction is fairly common; so is alias
| prediction that will speculate whether two data addresses (at
| least one if which is a store) do not alias; AMD memory
| renaming also makes assumptions around stack addresses; outside
| of these special cases I'm not aware on any cpu that will
| actually generally speculate actual values or address.
|
| On relaxed memory model machines naive value prediction would
| break dependency ordering (i.e. memory order consume), so
| things get very complicated.
| titzer wrote:
| What you posted is right, but I wouldn't classify indirect
| branch prediction as value prediction, because it doesn't
| supply a value to any other part of the processor (e.g.
| substitute a predicted value into a register). It just steers
| the frontend to where instructions should be fetched from, so
| it is just branch prediction.
| xfer wrote:
| So the solution is to write assembly for an extremely common
| micro-architecture design. Maybe the compilers should provide
| more tools to exploit such parallelism. That asm block looks like
| a weird hack.
___________________________________________________________________
(page generated 2021-07-23 23:00 UTC)