[HN Gopher] The Case of the Missing Increment
___________________________________________________________________
The Case of the Missing Increment
Author : eigenform
Score : 69 points
Date : 2024-09-27 13:36 UTC (4 days ago)
(HTM) web link (www.computerenhance.com)
(TXT) w3m dump (www.computerenhance.com)
| vardump wrote:
| Just when you get used with features like x86 CPUs combining two
| instructions into one micro-op (micro-op fusing), you get
| something like this.
|
| I guess immediate addressing mode addition is a good choice to
| execute at rename / allocation stage, as it's common, relatively
| simple and can't generate exceptions.
| eigenform wrote:
| > immediate addressing mode addition
|
| Well, except for the fact that you need to read from a register
| before adding the immediate displacement to it. You'd have to
| know the physical register and do the read very early (before
| renaming), or predict the value!
| eigenform wrote:
| I just realized you were probably referring to the example
| given from the AnandTech article with `lea r64, [r64+imm8]`.
|
| Caveat is just that [presumably] the source and destination
| registers have to be matching (since `lea rax, [rax+imm]` is
| just `add rax, imm`).
| Taniwha wrote:
| This isn't really combining as the result of the first
| increment is needed by the intermediate compare, but is a
| rewriting that removes a dependency (or moves it further back
| in the stream)
| vardump wrote:
| Maybe it rewrites multiple immediate additions into one.
| Taniwha wrote:
| Thinking about this - this may be a pattern that;s designed to
| match something that expands from a string instruction.
|
| While the loop he's testing is a useless bit of code that does
| nothing the optimisation he's discovered may help speed things
| like scasb/stosb allowing portions of 2 unrolled copies to be
| processed per clock
| buttocks wrote:
| Deep thoughts: why aren't "increment" and "excrement" opposites?
| Joker_vD wrote:
| Because "increase" and "excrete" have completely different
| roots that only coincidentally coincide when the verbal nouns
| corresponding to those words are formed.
| knodi123 wrote:
| now do "progress" and "congress"!
| Joker_vD wrote:
| You mean, the difference between "going forward" and
| "coming together"? It's in the prefix, "pro-" (for,
| forward) versus "con-" (with, together) which give you
| different shades of the meaning. Can't really say what's
| the verb of movement was though.
| oersted wrote:
| I think he meant it as an absurdist joke, but this is a
| great response!
|
| I looked it up, "gress" comes from "gradi" in Latin which
| directly translates to "walk". More specifically:
| con(pro) + gradi -> congredi (verb) -> congressus (noun)
|
| Edit: Knowing this, "gradient" has an interesting flavour
| :)
|
| Edit: It looks like the path is more indirect for
| "gradient"
|
| "gradi" (walk) -> "gradus" (step) -> "grade" (french
| influence) + "salient" -> "gradient". I like that in
| Latin "walk" is "to step", or perhaps "step" is "the unit
| of walking"? "A walking"? Etymology is fun!
| randomdata wrote:
| now do "flammable" and "inflammable"!
| dpkirchner wrote:
| What a country!
| Joker_vD wrote:
| > I like that in Latin "walk" is "to step", or perhaps
| "step" is "the unit of walking"? "A walking"?
|
| Consider the verb "to pace", and the corresponding noun
| "pace": the analogy is almost perfect. Of course, Latin
| also had other words for going places.
| IWeldMelons wrote:
| Your name checks out. You should be an expert in that
| (excremental) matters.
| leiroigh wrote:
| That's pretty cool.
|
| Normally it would be the either the programmer's or the
| compiler's job to unroll a loop and then reduce dependency chain
| lengths.
|
| But its nice if the renamer can do that as well.
|
| Presumably intel have real-world data that suggest that
| significant real workloads can profit from this.
|
| I wonder whether that points to specific software issues, like
| hypothetically "oh yeah, openjdk8 hotspot was a little too timid
| at loop unrolling. It won't get that JIT improvement backported,
| but our customers will use java8 forever. Better fix that in
| silicon".
| pkhuong wrote:
| I believe I first saw this on IACA; uops.info has the
| measurements for zero-latency inc, add, etc on Alder Lake
| https://uops.info/html-instr/INC_R64.html . These adds by
| immediate are nicely closed, so I've been assuming renamed values
| are uniformly represented in Golden Cove as register+increment.
| zokier wrote:
| > Since the only Alder Lake machine I had access to was a remote
| Windows machine that didn't belong to me, I more-or-less had to
| choose option 3, which meant subjecting myself to The Ultimate
| Sadness
|
| Well, you can pick up Sapphire Rapids instances from your
| preferred cloud provider and avoid the sadness.
| deater wrote:
| do cloud providers give full, unrestricted access to hardware
| performance counters?
| zokier wrote:
| It depends. On AWS you can get "metal" instances where afaik
| you get pretty much unrestricted access. In addition on
| certain instance types/sizes you get access to virtualized
| counters (vPMU). See Q11 here
| https://github.com/intel/pcm/blob/master/doc/FAQ.md#q11 or
| tables here https://www.intel.com/content/www/us/en/developer
| /articles/t...
|
| dunno about others
| mzs wrote:
| You have to use an instruction like cpuid with rdtsc so that the
| TSC is not read before the loop terminates. There have been
| changes to the Intel docs and there are more options now:
|
| https://stackoverflow.com/a/58146426
|
| Also in the bad old days SMM would interfere on some CPUs.
___________________________________________________________________
(page generated 2024-10-01 23:02 UTC)