[HN Gopher] Why xor eax, eax?
___________________________________________________________________
Why xor eax, eax?
Author : hasheddan
Score : 441 points
Date : 2025-12-01 12:22 UTC (10 hours ago)
(HTM) web link (xania.org)
(TXT) w3m dump (xania.org)
| daeken wrote:
| Back in 2005 or 2006, I was working at a little startup with "DVD
| Jon" Johansen and we'd have Quake 3 tournaments to break up the
| monotony of reverse-engineering and juggling storage
| infrastructure. His name was always "xor eax,eax" and I always
| just had to laugh at the idea of getting zeroed out by someone
| with that name. (Which happened a lot -- I was good, but he was
| much better!)
| VectorLock wrote:
| I was there but never got in on the Quake 3 fun; mp3t**
| OgsyedIE wrote:
| The page crashes after 3 seconds, 100% of the time, on the latest
| version of Android Chrome and works fine on Brave, fyi.
| robmccoll wrote:
| This is not my experience on the latest version of Chrome
| Android (142.0.7444.171). It did not crash for me.
| pansa2 wrote:
| > _Unlike other partial register writes, when writing to an e
| register like eax, the architecture zeros the top 32 bits for
| free._
|
| I'm familiar with 32-bit x86 assembly from writing it 10-20 years
| ago. So I was aware of the benefit of xor in general, but the
| above quote was new to me.
|
| I don't have any experience with 64-bit assembly - is there a
| guide anywhere that teaches 64-bit specifics like the above?
| Something like "x64 for those who know x86"?
| veltas wrote:
| Chapter 3 of volume 1, ctrl+f for "64-bit mode", has a lot of
| the essentials including e.g. the stuff about zeroing out the
| top half of the register.
|
| https://www.intel.com/content/www/us/en/developer/articles/t...
| matt_d wrote:
| See
| https://github.com/MattPD/cpplinks/blob/master/assembly.x86....
| - mostly focused on x86-64 (and some of the talks/tutorials
| offer pretty good overview)
| sparkie wrote:
| It's not only xor that does this, but most 32-bit operations
| zero-extend the result of the 64-bit register. AMD did this for
| backward compatibility. so existing programs would mostly
| continue working, unlike Intel's earlier attempt at 64-bits
| which was an entirely new design.
|
| The reason `xor eax,eax` is preferred to `xor rax,rax` is due
| to how the instructions are encoded - it saves one byte which
| in turn reduces instruction cache usage.
|
| When using 64-bit operations, a REX prefix is required on the
| instruction (byte 0x40..0x4F), which serves two purposes - the
| MSB of the low nybble (W) being set (ie, REX prefixes
| 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of
| low nybble allow using registers r8-r15 by providing an extra
| bit for the ModRM register field and the base and index fields
| in the SIB byte, as only 3-bits (8-registers) are provided by
| x86.
|
| A recent addition, APX, adds an additional 16 registers
| (r16-r31), which need 2 additional bits. There's a REX2 prefix
| for this (0xD5 ...), which is a two byte prefix to the
| instruction. REX2 replaces the REX prefix when accessing
| r16-r31, still contains the W bit, but it also includes an `M0`
| bit, which says which of the two main opcode maps to use, which
| replaces the 0x0F prefix, so it has no additional cost over the
| REX prefix when accessing the second opcode map.
| cesarb wrote:
| > It's not only xor that does this, but most 32-bit
| operations zero-extend the result of the 64-bit register. AMD
| did this for backward compatibility.
|
| It's not just that, zero-extending or sign-extending the
| result is also better for out-of-order implementations. If
| parts of the output register are preserved, the instruction
| needs an extra dependency on the original value.
| ychen306 wrote:
| This. It's for renaming.
| nickelpro wrote:
| Except for `xchg eax, eax`, which was the canonical nop on
| x86. Because it was supposed to do nothing, having it zero
| out the top 32-bits of rax would be quite surprising. So it
| doesn't.
|
| Instead you need to use the multi-byte, general purpose
| encoding of `xchg` for `xchg eax, eax` to get the expected
| behavior.
| snvzz wrote:
| Because, unlike RISC-V, x86 has no x0 register.
| jabl wrote:
| From your past posting history, I presume that you're implying
| this makes RISC-V better?
|
| Do we have any data showing that having a dedicated zero
| register is better than a short and canonical instruction for
| zeroing an arbitrary register?
| kevin_thibedeau wrote:
| It's a definite liability on a machine with only 8 general
| purpose registers. Losing 12% of the register space for a
| constant would be a waste of hardware.
| menaerus wrote:
| 8 registers? Ever heard of register renaming?
| Polizeiposaune wrote:
| Ever heard of a loop that needed to keep more than 7
| variables live? Register renaming helps with pipelining
| and out-of-order execution, but instructions in the
| program can only reference the architectural registers -
| go beyond that and you end up needing to spill some
| values to (architectural) memory.
|
| There's a reason why AMD added r8-r15 to the
| architecture, and why intel is adding r16-r31..
| menaerus wrote:
| I have but that was not the point? My first point was
| exactly that there are more ISA registers and not only 8,
| and therefore the question mark. My second point was
| about register renaming which, contrary what you say,
| does mitigate the artifacts of running out of registers
| by spilling the variables to the stack memory. It does it
| by eliminating the false dependencies between
| variables/registers and xor eax, eax is a great candidate
| for that.
| saagarjha wrote:
| Register renaming does not let you avoid spills.
| menaerus wrote:
| Ok, it obviously doesn't increase the number of ISA
| registers. What I am suggesting is something else -
| imagine a situation in which the compiler understands
| that the spill over will take place, and therefore
| rearranges the code such that it reduces the pressure on
| the registers. It can do that if it can break the data
| dependencies between the variables for instance. Or it
| can do that by unrolling the loops or by moving the
| initialization closer to where the variable is being
| used, no? I am pretty certain that compilers are already
| doing these kind of transformations, and in a sense this
| is taking advantage of register renaming but indirectly.
| account42 wrote:
| That's irrelevant, the zero register would be taking a
| slot in the limited register addressing bits in
| instructions, not replace a physical register on the
| chip.
| kevin_thibedeau wrote:
| 8086 doesn't have that.
| dooglius wrote:
| I think one could just pick a convention where a particular
| GP register is zeroed at program startup and just make your
| own zero register that way, getting all the benefits at very
| small cost. The microarchitecture AIUI has a dedicated zero
| register so any processor-level optimizations would still
| apply.
| pklausler wrote:
| That's what was done on the CDC 6600 with two handy values,
| B0 (0) and B1 (1).
| gruez wrote:
| It's basically the eternal debate of RISC vs CISC (x86). RISC
| proponents claim RISC is better because it's simpler to
| decode. CISC proponents retort that CISC means code can be
| more compact, which helps with cache hits.
| bluGill wrote:
| In the real world there is no CISC or RISC anymore. RISC is
| always extended to some new feature and suddenly becomes
| more complex. Meanwhile CISC is just a decoder over a RISC
| processor. Either way you get the best of both worlds:
| simple hardware (the RISC internals and CSIC instructions
| that do what you need.
|
| Don't get too carried away in the above, x86 is still a lot
| more complex than ARM or RISC-V. However the complexity is
| only a tiny part of a CPU and so it doesn't matter.
| phire wrote:
| The zero register helps RISC-V (and MIPS before it) really
| cut down on the number of instructions, and hardware
| complexity.
|
| You don't need a mov instruction, you just OR with $zero. You
| don't need a load immediate instruction you just ADDI/ORI
| with $zero. You don't need a Neg instruction, you just SUB
| with $zero. All your Compare-And-Branch instructions get a
| compare with $zero variant for free.
|
| I refuse to say this "zero register" approach is better, it
| is part of a wide design with many interacting features. But
| once you have 31 registers, it's quite cheap to allocate one
| register to be zero, and may actually save encoding space
| elsewhere. (And encoding space is always an issue with fixed
| width instructions).
|
| AArch64 takes the concept further, they have a register that
| is sometimes acts as the zero register (when used in ALU
| instructions) and other times is the stack pointer (when used
| in memory instructions and a few special stack instructions).
| phkahler wrote:
| >> The zero register helps RISC-V (and MIPS before it)
| really cut down on the number of instructions, and hardware
| complexity.
|
| Which if funny because IMHO RISC-V instruction encoding is
| garbage. It was all optimized around the idea of fixed
| length 32-bit instructions. This leads to weird sized
| immediates (12 bits?) and 2 instructions to load a 32 bit
| constant. No support for 64 bit immediates. Then they
| decided to have "compressed" instructions that are 16 bits,
| so it's somewhat variable length anyway.
|
| IMHO once all the vector, AI and graphics instructions are
| nailed down they should make RISC-VI where it's almost the
| same but re-encoding the instructions. Have sensible 16-bit
| ones, 32-bit, and use immediate constants after the
| opcodes. It seems like there is a _lot_ they could do to
| clean it up - obviously not as much as x86 ;-)
| zozbot234 wrote:
| There's not a strong case for redoing the RISC-V encoding
| with a new RISC-VI unless they run out of 32-bit encoding
| space outright, due to e.g. extensive new vector-like or
| AI-like instructions. And then they could free up a huge
| amount of encoding space trivially by moving to a
| 2-address format throughout with Rd=Rs1 and using a
| simple instruction fusion approach MOV Rd - Rs1; OP Rd -
| etc. for the former 3-address case.
|
| (Any instruction that can be similarly rephrased as a
| composition of more restricted elementary instructions is
| also a candidate for this macro-insn approach.)
| phkahler wrote:
| >> Any instruction that can be similarly rephrased as a
| composition of more restricted elementary instructions is
| also a candidate for this macro-insn approach.
|
| I really like the idea of composition or standard
| prefixes. My favorite is the idea of replacing cmp/branch
| with "if". Where the condition is a predicate for the
| following instruction. For RISC-V it would eat a large
| part of the 16bit opcodes. Some form of load/store might
| be a good use for the remaining 16bit ops. Other things
| that might be a good prefix could be encoding data types
| (8,16,32,64 bit, sign extended, float, double) or a
| source/destination register. It might be interesting to
| see how a full ISA might be decomposed into smaller
| instruction fragments.
| zozbot234 wrote:
| > "if". Where the condition is a predicate for the
| following instruction
|
| This is just a forward skip, which is optimized to a
| predicated insn already in some implementations.
| adgjlsfhk1 wrote:
| IMO the riscv decoding is really elegant (arguably
| excepting the C extension). Things like 64 bit immediates
| are almost certainly a bad idea (as opposed to just
| having a load from memory). Most 64 bit constants in use
| can be sign extended from much smaller values, and for
| those that can't, supporting 72 bit (or bigger)
| instructions just to be able to load a 64 bit immediate
| will necessarily bloat instruction cache, stall your
| instruction decoder (or limit parallelism), and will only
| be 2 cycles faster than a L1 cache load (if the
| instruction is hot). 32 bit immediate would be kind of
| nice, but the benefit is pretty small. An x86 instruction
| with 32 bit immediate is 6 bytes, while the 2 RISC-V
| instructions are 8 bytes. There have been proposals to
| add 48 bit instructions, which would let Risc-v have 32
| bit immediate support with the same 6 bytes as x86 (and
| 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in
| the very rare situations where doing so will be faster
| than a load).
|
| ISA design is always a tradeoff, https://ics.uci.edu/~swj
| un/courses/2023F-CS250P/materials/le... has some good
| details, but the TLDR is that RISC-V makes reasonable
| choices for a fairly "boring" ISA.
| Tuna-Fish wrote:
| > Things like 64 bit immediates are almost certainly a
| bad idea (as opposed to just having a load from memory)
|
| Strongly disagree. Throughput is cheap, latency is
| expensive. Any time you can fit a constant in the
| instruction fetch stream is a win. This is especially
| true for jump targets, because getting them resolved
| faster both saves power and improves performance.
|
| > Most 64 bit constants in use can be sign extended from
| much smaller values
|
| You should obviously also have smaller load instructions.
|
| > will necessarily bloat instruction cache, stall your
| instruction decoder (or limit parallelism)
|
| No, just have more fetch throughput.
|
| > and will only be 2 cycles faster than a L1 cache load
|
| Only on tiny machines will L1 cache load be 2 cycles. On
| a reasonable high-end machine it will be 4-5 cycles, and
| more critically (because the latency would usually be
| masked well by OoO), the energy required to engage the
| load path is orders of magnitude more than just getting
| it from the fetch.
|
| And that's when it's not a jump target, when it's a jump
| target suddenly loading it using a load instruction adds
| 12+ cycles of latency.
|
| > TLDR is that RISC-V makes reasonable choices for a
| fairly "boring" ISA.
|
| No. Not even talking about constants, RISC-V makes insane
| choices for essentially religious reasons. Can you
| explain to me why, exactly, would you ever make jal take
| a register operand, instead of using a fixed link
| register and putting the spare bits into the address
| immediate?
| adgjlsfhk1 wrote:
| > No, just have more fetch throughput.
|
| Fetch throughput isn't unlimited. Modern x86 CPUs only
| have ~16-32B/cycle (from L2 once you're out of the uop
| cache). If you decode a single 10 byte instruction you're
| already using up a huge amount of the available decode
| bandwidth.
|
| There absolutely are cases where a 64 bit load
| instruction would be an advantage, but ISA design is
| always a case of tradeoffs. Allowing 10 byte instructions
| has real cost in decode complexity, instruction bandwidth
| requirements, ensuring cacheline/pageline alignment etc.
| You have to weigh against that how frequent the
| instruction would be as well as what your alternative
| options are. Most imediates are small, and many others
| can be efficiently synthesized via 2 other instructions
| (e.g. shifts/xors/nots) and any synthesis that is 2
| instructions or fewer will be cheaper than doing a load
| anyway. As a result you would end up massively
| complicating your architecture/decoders to benefit a
| fairly rare instruction which probably isn't worthwhile.
| It's notable that aarch64 makes the same tradeoff here
| and Apple's M series processors have an IPC advantage
| over the best x86.
|
| > Can you explain to me why, exactly, would you ever make
| jal take a register operand, instead of using a fixed
| link register and putting the spare bits into the address
| immediate?
|
| This mostly seems like a mistake to me. The rational
| probably is that you need the other instructions anyway
| (not all jumps are returns), so adding a jal that doesn't
| take a register would take a decent percentage of the
| opspace, but the extra 5 bits would be very nice.
| kruador wrote:
| ARM64 also has fixed length 32-bit instructions. Yes,
| immediates are normally small and it's not particularly
| orthogonal as to how many bits are available.
|
| The largest MOV available is 16 bits, but those 16 bits
| can be shifted by 0, 16, 32 or 48 bits, so the worst case
| for a 64-bit immediate is 4 instructions. Or the compiler
| can decide to put the data in a PC-relative pool and use
| ADR or ADRP to calculate the address.
|
| ADD immediate is 12 bits but can optionally apply a
| 12-bit left-shift to that immediate, so for immediates up
| to 24 bits it can be done in two instructions.
|
| ARM64 decoding is also pretty complex, _far_ less
| orthogonal than ARM32. Then again, ARM32 was designed to
| be decodable on a chip with 25,000 transistors, not where
| you can spend thousands of transistors to decode a single
| instruction.
| wongarsu wrote:
| MIPS for example also has one, along with a similar number of
| registers (~32). So it's not like RISC-V took a radical new
| position here, they were able to look back at what worked and
| what didn't, and decided that for their target a zero
| register was the right tradeoff. It's certainly the more
| "elegant" solution. A zero register is useful as input or
| output register for all kinds of operations, not just for
| zeroing
| crote wrote:
| And the other way around: RISC-V doesn't have a move
| instruction so that's done as "dst = src + 0", and it doesn't
| have a nop instruction so that's done as "x0 = x0 + 0". There's
| like a dozen of them.
|
| It's quite interesting what neat tricks roll out once you've
| got a guaranteed zero register - it greatly reduces the number
| of distinct instructions you need for what is basically the
| same operation.
| dist1ll wrote:
| Another one is "jalr x0, imm(x0)", which turns an indirect
| branch into a direct jump to address "imm" in a single
| instruction w/o clobbering a register. Pretty neat.
| Findecanor wrote:
| There is a `c.mv` instruction in the compressed set, which
| most RISC-V processors implement.
|
| That, _and_ `add rd, rs, x0` could (like the zeroing idiom on
| x86), run entirely in the decoding and register-renaming
| stages of a processor.
|
| RISC-V does actually have quite a few idioms. Some idioms are
| multi-instruction sequences ("macro ops") that could get
| folded into single micro-ops ("macro-op fusion"/"instruction
| fusion"): for example `lui` followed by `addi` for loading a
| 32-bit constant, and left shift followed by right shift for
| extracting a bitfield.
| kruador wrote:
| ARM64 assembly has a MOV instruction, but for most of the
| ways it's used, it's an alias in the assembler to something
| else. For example, MOV between two registers actually
| generates ORR rd, rZR, rm, i.e. rd := (zero-register) OR rm.
| Or, a MOV with a small immediate is ORR rd, rZR, #imm.
|
| If trying to set the stack pointer, or copy the stack
| pointer, instead the underlying instruction is ADD SP, Xn, #0
| i.e. SP = Xn + 0. This is because the stack pointer and zero
| register are both encoded as register 31 (11111). Some
| instructions allow you to use the zero register, others the
| stack pointer. Presumably ORR uses the zero register and ADD
| the stack pointer.
|
| NOP maps to HINT #0. There are 128 HINT values available;
| anything not implemented on this processor executes as a NOP.
|
| There are other operations that are aliased like CMP Xm, Xn
| is really an alias for SUBS XZR, Xm, Xn: subtract Xn from Xm,
| store the result in the zero register [i.e. discard it], and
| set the flags. RISC-V doesn't have flags, of course. ARM Ltd
| clearly considered them still useful.
|
| There are other oddities, things like 'rotate right' is
| encoded as 'extract register from pair of registers', but it
| specifies the same source register twice.
|
| Disassemblers do their best to hide this from you. ARM list a
| 'preferred decoding' for any instruction that has aliases, to
| map back to a more meaningful alias wherever possible.
| gpderetta wrote:
| x86 doesn't need a zero register as it can encode constants in
| the instruction itself.
| Findecanor wrote:
| x86 has no architectural zero register, but a x86 CPU could
| have a _microarchitectural_ zero register.
|
| And when the instruction decoder in such a CPU with register
| renaming sees `xor eax, eax`, it just makes `eax` point to the
| zero register for instructions after it. It does not have to
| put any instruction into the pipeline, and it takes effectively
| 0 cycles. That is what makes the "zeroing idiom" so powerful.
| omnicognate wrote:
| It happens to be the first instruction of the first snippet in
| the wonderful xchg rax,rax.
|
| https://www.xorpd.net/pages/xchg_rax/snip_00.html
| dooglius wrote:
| Not sure what I am looking at here is this just a bunch of
| different ways to zero registers?
| omnicognate wrote:
| It's a collection of interesting assembly snippets ("gems and
| riddles" in the author's words) presented without commentary.
| People have posted annotated "solutions" online, but figuring
| out what the snippets do and why they are interesting is the
| fun of it.
|
| It's also available as an inscrutable printed book on Amazon.
| mubou2 wrote:
| That music when you click "int" is awesome. Reminds me of the
| good ol' days of keygens.
| Audiophilip wrote:
| It's a chiptune-style xm module, "Funky Stars" by Quazar:
| https://soundcloud.com/scene_music/funky-stars
| therein wrote:
| Keygen music will always have a special place in my heart.
| This is a good one.
|
| I do wonder who was the first cracker that thought of
| including a keygen music that started the tradition.
|
| I also miss how different groups competed with each other and
| boasted about theirs while dissing others in readmes.
|
| Readme's would have .NFO suffix and that would try to load in
| some Windows tool but you had to open them in notepad. Good
| times.
| eb0la wrote:
| I remember a lot of code zeroing registrers, dating _at least_
| back from the IBM PC XT days (before the 80286).
|
| If you decode the instruction, it makes sense to use XOR:
|
| - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3
| bytes (66 31 c0)
|
| This extra byte in a machine with less than 1 Megabyte of memory
| _did_ id matter.
|
| In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00
| 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)
|
| Here Intel made the decision to use only 2 bytes. I bet this
| helps both the instruction decoder and (of course) saves more
| memory than the old 8086 instruction.
| vardump wrote:
| > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
| 3 bytes (66 31 c0)
|
| You don't need operand size prefix 0x66 when running 16 bit
| code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax"
| is just 2 bytes.
| eb0la wrote:
| My fault: I just compiled the instruction with an assembler
| instead of looking up the actual instruction from
| documentation.
|
| It makes much more sense: resetting ax, and bc (xor ax,ax ;
| xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster
| to fetch by the x86 than the 3-octet version I wrote before.
| RHSeeger wrote:
| > the IBM PC XT days (before the 80286)
|
| Fun fact - the IBM PC XT also came in a 286 model (the XT 286).
| eb0la wrote:
| You're right. I forgot that!
| Anarch157a wrote:
| I don't know enough of the 8086 so I don't know if this works
| the same, but on the Z80 (which means it was probably true for
| the 8080 too), XOR A would also clear pretty much all bits on
| the flag register, meaning the flags would be in a known state
| before doing something that could affect them.
| vanderZwan wrote:
| Which I guess is the same reason why modern Intel CPU
| pipelines can rely on it for pipelining.
| Someone wrote:
| > If you decode the instruction, it makes sense to use XOR:
|
| > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
| 3 bytes (66 31 c0)
|
| Except, apparently, on the pentium Pro, according to this
| comment: https://randomascii.wordpress.com/2012/12/29/the-
| surprising-..., which says:
|
| _"But there was at least one out-of-order design that did not
| recognize xor reg, reg as a special case: the Pentium Pro. The
| Intel Optimization manuals for the Pentium Pro recommended
| "mov" to zero a register."_
| Sharlin wrote:
| As the author says, a couple of extra bytes _still_ matter,
| perhaps more than 20ish years ago. There are vast amounts of
| RAM, sure, but it 's glacially slow, and there's only a few
| tens of kBs of L1 instruction cache.
|
| Never mind the fact that, as the author also mentions, the xor
| idiom takes essentially zero cycles to execute because nothing
| actually _happens_ besides assigning a new pre-zeroed physical
| register to the logical register name early on in the pipeline,
| after which the instruction is retired.
| cogman10 wrote:
| L1 instruction cache is backed by L2 and L3 caches.
|
| For the AMD 9950, we are talking about 1280kb of L1 (per
| core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if
| you have the X3D version).
|
| I won't say it doesn't matter, but it doesn't matter as much
| as it once did. CPU caches have gotten huge while the
| instructions remain the same size.
|
| The more important part, at this point, is it's idiomatic.
| That means hardware designers are much more likely to put in
| specialty logic to make sure it's fast. It's a common enough
| operation to deserve it's own special cases. You can fit a
| lot of 8 byte instructions into 1280kb of memory. And as it
| turns out, it's pretty common for applications to spend a lot
| of their time in small chunks of instructions. The slow part
| of a lot of code will be that `for loop` with the 30 AVX
| instructions doing magic. That's why you'll often see
| compilers burn `NOP` instructions to align a loop. That's to
| avoid splitting a cache line.
| Sharlin wrote:
| > For the AMD 9950, we are talking about 1280kb of L1 (per
| core). 16MB of L2 (per core)
|
| Ryzen 9 CPUs have 1280kB of L1 _in total_. 80kB (48+32) per
| core, and the 9 series is the first in the _entire history_
| of Ryzens to have some other number than 64 (32+32)
| kilobytes of L1 per core. The 16MB L2 figure is also
| _total_. 1MB per core, same as the 7 series. AMD obviously
| touts the total, not per-core, amounts in their marketing
| materials because it looks more impressive.
| kbolino wrote:
| Also, rather importantly, the L1 _i_ (instruction) cache
| is still only 32 kB. The part that got bigger, the 48 kB
| of L1 _d_ (data) cache, does not count for this purpose.
| monocasa wrote:
| Yeah, the reason for that is that it's expensive in PPA
| for the size of an L1 cache to exceed number of ways
| times page size. The jump to 48kB was also a jump to 12
| way set associative.
|
| As an aside, zen 1 did actually have a 64kB (and only 4
| way!) L1I cache, but changed to the page size times way
| count restriction with zen 2, reducing the L1 size by
| half.
|
| You can also see this on the apple side, where their
| giant 192kB caches L1I are 12 ways with a 16kB page size.
| gpderetta wrote:
| Instruction caches also prefetch very well, as long as
| branch prediction is good. Of course on a misprediction you
| might also suffer a cache miss in addition to the normal
| penalty.
| umanwizard wrote:
| > nothing actually happens besides assigning a new pre-zeroed
| physical register to the logical register name early on in
| the pipeline, after which the instruction is retired.
|
| This is slightly inaccurate -- instructions retire in order,
| so it doesn't necessarily retire immediately after it's
| decoded and the new zeroed register is assigned. It has to
| sit in the reorder buffer waiting until all the instructions
| ahead of it are retired as well.
|
| Thus in workloads where reorder buffer size is a bottleneck,
| it could contribute to that. However I doubt this describes
| most workloads.
| Sharlin wrote:
| Thanks, that makes sense.
| chasd00 wrote:
| > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
| 3 bytes (66 31 c0)
|
| iirc doesn't word alignment matter? I have no idea if this is
| how the IBM PC XT was aligned but if you had 4 byte words then
| it doesn't matter if you save a byte with xor because you
| wouldn't be able to use it for anything else anyway. again,
| iirc.
| deadcore wrote:
| Matt Godbolt also uploads to his self titled Youtube channel:
| https://www.youtube.com/watch?v=eLjZ48gqbyg
| vanderZwan wrote:
| Not sure why you got downvoted for pointing that out - it might
| be linked at the end of the article but people can still miss
| that.
| deadcore wrote:
| *shrugs* the internet being the internet I suppose.
|
| There was "See the video that accompanies this post." but NGL
| was just posting encase anyone didn't have time to read or
| missed it.
| pclmulqdq wrote:
| In modern CPUs, a lot of these are recognized as zeroing idioms
| and they end up doing the same thing (often a register renaming
| trick). Using the shortest one makes sense. If you use a really
| weird zeroing pattern, you can also see it as a backend uop while
| many of these zeroing idioms are elided by the frontend on some
| cores.
| Dwedit wrote:
| Because "sub eax,eax" looks stupid. (and also clears the carry
| flag, unlike "xor eax, eax")
| tom_ wrote:
| xor clears the carry as well? In fact, looks like xor and sub
| affect the same set of flags!
|
| xor:
|
| > The OF and CF flags are cleared; the SF, ZF, and PF flags are
| set according to the result. The state of the AF flag is
| undefined.
|
| sub:
|
| > The OF, SF, ZF, AF, PF, and CF flags are set according to the
| result.
|
| (I don't have an x64 system handy, but hopefully the reference
| manual can be trusted. I dimly remembered this, or something
| like it, tripping me up after coming from programming for the
| 6502.)
| trollbridge wrote:
| This is a good thing since the pipeline now doesn't have to
| track the state of the flags since they all got zero'd.
| sfink wrote:
| Strangely, the only difference on the flags is that AF
| (auxiliary carry) is undefined for `xor eax, eax` but
| guaranteed to be zeroed for `sub eax, eax`. I don't know what
| that means in practice, though I'm guessing that at the very
| least the hardware would not treat it as a dependency on the
| previous value.
| HackerThemAll wrote:
| If I remember correctly, sub used to be slower than xor on some
| ancient architectures.
| sylware wrote:
| Remnant of RISC attempt without a zero register.
| fooker wrote:
| It's funny how machine code is a high level language nowadays,
| for this example the CPU recognizes the zeroing pattern and does
| something quite a bit different.
| Reubensson wrote:
| What do you mean that cpu does something different? Isnt cpu
| doing what is being asked, that being xor with consequence of
| zeroing when given two same values.
| IsTom wrote:
| I think OP means that it has come a long way from the simple
| mental model of uops being a direct execution of operations
| and with all the register renamings and so on
| dooglius wrote:
| FTA:
|
| > And, having done that it removes the operation from the
| execution queue - that is the xor takes zero execution
| cycles!1 It's essentially optimised out by the CPU
| fooker wrote:
| Same consequence yes.
|
| But it will not execute xor, nor will it actually zero out
| eax in most cases.
|
| It'll do something similar to constant propagation with the
| information that whenever xor eax, eax occurs; all uses of
| eax go through a simpler execution path until eax is
| overwritten.
| 12_throw_away wrote:
| > with consequence of zeroing when given two same values
|
| Right, it has the same consequence, but it doesn't actually
| perform the stated operation. ASM is just a now just a high
| level language that tells the computer to "please give me the
| same state that a PDP-11-like computer would give me upon
| executing these instructions."
| dheatov wrote:
| It's really impressive how powerful and efficient it has
| become. However, I find it so much more difficult to build
| mental model of it. I've been struggling with atomic and r/w
| barrier as there are sooo many ways the instructions could've
| been executed (or not executed!).
| fooker wrote:
| It's a consequence of keeping our general purpose single
| threaded programming model the same for five decades.
|
| It has it's merits, but the underlying hardware has changed.
|
| Intel tried to push this responsibility to the compiler with
| Itanium but that failed catastrophicically, so we're back to
| the CPU pretending it's 1985.
| silverfrost wrote:
| Back on the Z80 'xor a' is the shortest sequence to zero A
| fortran77 wrote:
| Back when I did IBM 370 BAL Assembly Language, we did the same
| thing to clear a register to zero. XR 15,15
| XOR REGISTER 15 WITH REGISTER 15
|
| vs L 15,=F'0' LOAD REGISTER 15 WITH 0
|
| This was alleged to be faster on the 370 because because XR
| operated entirely within the CPU registers, and L (Load) fetched
| data from memory (i.e.., the constant came from program memory).
| vanderZwan wrote:
| > _In my 6502 hacking days, the presence of an exclusive OR was a
| sure-fire indicator you'd either found the encryption part of the
| code, or some kind of sprite routine._
|
| Meanwhile, people like me who got started with a Z80 instead
| immediately knew why, since XOR A is the smallest _and_ fastest
| way to clear the accumulator and flag register. Funny how that
| also shows how specific this is to a particular CPU lineage or
| its offshoots.
| jgrahamc wrote:
| _In my 6502 hacking days, the presence of an exclusive OR was a
| sure-fire indicator you'd either found the encryption part of the
| code, or some kind of sprite routine._
|
| Yeah, sadly the 6502 didn't allow you to do EOR A; while the Z80
| did allow XOR A. If I remember correctly XOR A was AF and LD A, 0
| was 3E 01[1]. So saved a whole byte! And I think the XOR was 3
| clock cycles fast than the LD. So less space taken up by the
| instruction and faster.
|
| I have a very distinct memory in my first job (writing x86
| assembly) of the CEO walking up behind my desk and pointing out
| that I'd done MOV AX, 0 when I could have done XOR AX, AX.
|
| [1] 3E 00
| vanderZwan wrote:
| Hah, we commented on the exact same paragraph within a minute
| of each other! My memory agrees with your memory, although I
| think that should be 3E 00. Let me look that up:
|
| https://jnz.dk/z80/ld_r_n.html
|
| https://jnz.dk/z80/xor_r.html
|
| Yep, if I'm reading this right that's 3E 00, since the second
| byte is the immediate value.
|
| One difference between XOR and LD is that LD A, 0 does not
| affect flags, which sometimes mattered.
| jgrahamc wrote:
| You're right. Of course, it's 3E 00. Not sure how I
| remembered 3E 01. My only excuse is that it was 40 years ago!
| sfink wrote:
| What is this "LD A, 0" syntax? Is it a z80 thing?
|
| One of the random things burned into my memory for 6502
| assembly is that LDA is $A9. I never separated the
| instruction from the register; it's not like they were
| general purpose. But that might be because I learned
| programming from the 2 books that came with my C64, a BASIC
| manual and a machine code reference manual, and that's how
| they did it.
|
| I learned assembly programming by reading through the list of
| supported instructions. That, and typing in games from
| Compute's Gazette and manually disassembling the DATA
| instructions to understand how they worked. Oh, and the zero-
| page reference.
|
| Good times.
| Narishma wrote:
| > One of the random things burned into my memory for 6502
| assembly is that LDA is $A9. I never separated the
| instruction from the register; it's not like they were
| general purpose.
|
| You had LDA and LDX and LDY as separate instructions while
| the Z80 assembler had a single LD instruction with
| different operands. It's the same thing really.
| wavemode wrote:
| > CEO walking up behind my desk and pointing out that I'd done
| MOV AX, 0 when I could have done XOR AX, AX
|
| Now that's what I call micromanagement.
|
| (sorry couldn't resist)
| jgrahamc wrote:
| He was right though. We were memory and cycle constrained and
| I'd wasted both!
| xigoi wrote:
| The real joke is that a CEO had actual technical knowledge
| instead of just being there for decoration.
| mkornaukhov wrote:
| Similarly, the CEO couldn't resist the outstanding
| optimization of memory and execution speed!
| 6510 wrote:
| No one believes this story.
| jgrahamc wrote:
| I am sad you don't believe this story. The CEO was very
| technical and this is exactly the sort of thing he would
| spot.
| bombcar wrote:
| People don't realize that in the era of dinosaurs where
| MASM ruled and assembly walked the earth, there basically
| WEREN'T CEOs who didn't know the details, because all the
| companies doing this stuff were pretty small at the time
| (and the CEO may have been writing it himself a few years
| before).
| Analemma_ wrote:
| There was a time when Bill Gates wrote code for
| Microsoft, and he was actually quite good at it.
| nomel wrote:
| Not sure why this was voted down. He was very technical,
| especially for the time:
| https://www.thecrimson.com/article/2025/6/7/bill-gates-
| reuni...
| OrderlyTiamat wrote:
| My first part time dev job as a student featured me
| walking in on our CEO who showed me he was recompiling
| his kernel to enable some features. I'm quite sure he was
| just doing that to impress the students, but at least he
| knew how to!
| 6510 wrote:
| Similarly, if you told people in the 80's that it would
| be the opposite in the future no one would believe it
| either.
|
| Not even the developers are very technical in the future!
|
| Woah, really? And they still manage to write good
| software?
|
| Of course not, if good software would be standing next to
| their bed at 4 am they would scream who are you what are
| you doing here? help! help! Someone, make it go away!
| crest wrote:
| I had to pad the code for alignment reasons. ;-)
| ksherlock wrote:
| I mean, he IS the Chief EORfficer
| anonzzzies wrote:
| 3E 00 : I was on MSX and never had an assembler when you so I
| only remember the Hex, never actually knew the instructions; I
| wrote programs/games by data 3E,00,CD,etc without comments
| saying LD A as I never knew those at the time.
| unnah wrote:
| Umm... how did you manage to learn those hex codes? You just
| read a lot of machine code and it started to make sense?
| jgrahamc wrote:
| I started out writing machine code without an assembler and
| so had to hand assemble a lot of stuff. After a while you
| end up just knowing the common codes and can write your
| program directly. This was also useful because it was
| possible to write or modify programs directly through an
| interface sometimes called a "front panel" where you could
| change individual bytes in memory.
|
| Back in 1985 I did some hand-coding like this because I
| didn't have access to an assembler:
| https://blog.jgc.org/2013/04/how-i-coded-in-1985.html and I
| typed the whole program in through the keypad.
| stevekemp wrote:
| Same here. On/For the ZX Spectrum, looking up the hex-
| codes in the back of the orange book. At least it was
| spiral-bound to make it easier.
|
| Later still I'd be patching binaries to ensure their
| serial-checks passed, on Intel.
| kragen wrote:
| The instruction sets were a lot simpler at the time. The
| 8080 instruction set listing is only a few pages, and some
| of that is instructions you rarely use like RRC and DAA.
| The operand fields are always in the same place. My own
| summary of the instruction set is at
| https://dercuano.github.io/notes/8080-opcode-
| map.html#addtoc....
| amirhirsch wrote:
| I implemented a PDP-11 in 2007-10 and I can still read
| PDP-11 Octal
| af78 wrote:
| I had a similar experience of writing machine code for
| Z80-based computers (Amstrad CPC) in the 90's, as a
| teenager. I didn't have an assembler so I manually
| converted mnemonics to hex. I still remember a few opcodes:
| CD for CALL, C9 for RET, 01 for LD BC, 21 for LD HL...
| Needless to say, the process was tedious and error-prone.
| Calculating relative jumps was a pain. So was keeping track
| of offsets and addresses of variables and jump targets. I
| tended to insert nops to avoid having to recalculate
| everything in case I needed to modify some code... I can't
| say I miss these times.
|
| I'm quite sure none of my friends knew any CPU opcode;
| however, people usually remembered a few phone numbers.
| senderista wrote:
| It wasn't unusual in the 80s to type in machine code
| listings to a PC; I remember doing this as an 8-year-old
| from magazines, but I didn't understand any of the stuff I
| was typing in.
| anonzzzies wrote:
| Typing from mags, getting interested in how the magic works
| by learning to use a hex monitor and trying out things. I
| was a kid so time enough.
|
| I didn't know you could do it differently for years after I
| started.
| stevefan1999 wrote:
| > In my 6502 hacking days, the presence of an exclusive OR was
| a sure-fire indicator you'd either found the encryption part of
| the code, or some kind of sprite routine.
|
| Correct. Most ciphers of that era should be Feistel cipher in
| the likes of DES/3DES, or even RC4 uses XOR too. Later
| AES/Rijndael, CRC and ECC (Elliptic Curve Cryptography) also
| make heavy use of XOR but in finite field terms which is based
| on modular arithmetic over GF(2), that effectively reduces to
| XOR (while in theory should be mod 2).
| OhMeadhbh wrote:
| I was going to say "but RC4 and AES were published well after
| the 6502's heyday," but NESes were completely rocking it in
| '87 (and I'm told 65XX cores were used as the basis for
| several hard drive controllers of the era.) Alas, the closest
| I ever came to encryption on a (less than 32-bit system) was
| lucifer on an IBM channel controller in the forever-ago and
| debugging RC5 on an 8085.
| kjs3 wrote:
| _I 'm told 65XX cores were used as the basis for several
| hard drive controllers of the era_
|
| Western Design Center is still (apparently) making a profit
| at least in part licensing 6502 core IP for embedded stuff.
| There's probably a 6502 buried and unrecognized in all
| sorts of low-cost control applications laying around you.
|
| _RC5 on an 8085_
|
| Oof. Well played.
| PaulHoule wrote:
| I dunno. The 6502 has been a $2 part for a long time but
| needs RAM and some glue logic, for a similar price you
| can get an AVR-8 [1] or ESP-32 [2] and get some RAM and
| GPIO.
|
| [1] faster, more registers than the IBM 360, << 64k RAM
|
| [2] much faster, 32bit, >> 64k RAM
| rzzzt wrote:
| There are uC versions like the W65C134S:
| https://www.westerndesigncenter.com/wdc/w65c134s-chip.php
| kjs3 wrote:
| _I dunno._
|
| You don't know what, exactly? You can go to the web site
| and see what they are selling.
|
| _The 6502 has been a $2 part for a long time_
|
| I doubt that for an IP license at any volume such a thing
| would make sense.
|
| _but needs RAM and some glue logic_
|
| Sure? Embedded in whatever you're building.
|
| _for a similar price you can get..._
|
| Oh, sorry...my bad. You were doing it the HN way: "Don't
| actually read what was written for comprehension...just
| take your first knee jerk and tell them how you would
| obviously do it better.".
| ASalazarMX wrote:
| Reading cryptography was that advanced at that time, I'm even
| more surprised that the venerable Norton Utilities for MS-DOS
| required a password, that was simply XORed with some constant
| and embedded in the executables. If the reserved space was
| zeroes, it considered it a fresh install and demanded a new
| password.
|
| If it had been properly encrypted my young cracker self would
| have had no opportunity.
| mmphosis wrote:
| Try to keep the value 0 in the Y register. echo
| tya|asm|mondump -r|6502
| A=AA X=00 Y=00 S=00 P=22 PC=0300 0 0300- 98 TYA
| A=00 X=00 Y=00 S=00 P=22 PC=0301 2
| favorited wrote:
| "Prefer `xor a` instead of `ld a, 0`" is basically the first
| optimization that you learn when doing SM83 assembly.
|
| https://github.com/pret/pokecrystal/wiki/Optimizing-assembly...
| bitwize wrote:
| Because mov eax, 0 requires fetching a constant and prolongs
| instruction fetching/execution. XOR A was a trick I learned back
| in the Z80 days.
| dintech wrote:
| My brain read this is "Why not ear wax?"
| kragen wrote:
| xor wax, wax ; clear wax xor sax, sax ; clear sax
| xor fax, fax ; tru tru
| jabedude wrote:
| similarly IIRC, on (some generations of) x86 chips, NOP is sugar
| around `XCHG EAX, EAX` which is effectively a do-nothing
| operation
| bitwize wrote:
| This is pretty much all x86 chips as far as I'm aware: opcode
| 0x90 which is equivalent to XCHG AX,AX.
|
| The 8080 and Z80's NOP was at opcode 0. Which was neat because
| you could make a "NOP slide" simply by zeroing out memory.
| kccqzy wrote:
| There are multiple variants of nop mainly because you sometimes
| need the nop instruction to take up a certain number of bytes
| for alignment purposes. You have the 1-byte nop, but there is
| also the 9-byte nop.
| sixthDot wrote:
| I've wrote a lot of `xor al,al` in my youth.
| flohofwoe wrote:
| The actually surprising part to me is that such an important
| instruction uses a two byte encoding instead of one byte :)
| kccqzy wrote:
| Even supporting just 8 registers that would take up
| 8/256=0.03125 of the instruction encoding space.
| BiraIgnacio wrote:
| Also cool this got at the top item on the HN front page
| HackerThemAll wrote:
| > Interestingly, when zeroing the "extended" numbered registers
| (like r8), GCC still uses the d (double width, ie 32-bit)
| variant.
|
| Of course. I might have some data stored in the higher dword of
| that register.
| rfl890 wrote:
| Which will still be zeroed.
| Tuna-Fish wrote:
| Clearing e8 also clears the upper half.
|
| Partial register updates are kryptonite to OoO engines. For
| people used to low-level programming weak machines, it seems
| natural to just update part of a register, but the way every
| modern OoO CPU works that is literally not a possible
| operation. Registers are written to exactly once, and this
| operation also frees every subsequent instruction waiting for
| that register to be executed. Dirty registers don't get written
| to again, they are garbage collected and reset for next
| renaming.
|
| The only way to implement partial register updates is to add
| 3-operand instructions, and have the old register state to be
| the third input. This is also more expensive than it sounds
| like, and on many modern CPUs you can execute only one
| 3-operand integer instruction per clock, vs 4+ 2-operand ones.
| charles_f wrote:
| > By using a slightly more obscure instruction, we save three
| bytes every time we need to set a register to zero
|
| Meanwhile, most "apps" we get nowadays contain half of npmjs
| neatly bundled in electron. I miss the days when default was
| native and devs had constraints to how big their output could be.
| Filligree wrote:
| JS is just easier and takes less code.
|
| Which isn't an excuse anymore. UI coding isn't that hard; if
| someone can't do it, well, Claude certainly can.
| charles_f wrote:
| I'm fine with that, but keeping _some_ consideration to
| optimization should still be something, even in environments
| when constraints are low. The problem is when no-one cares
| and includes 4 versions of jquery in their app so that they
| don 't have to do const $=document.getElementById, everything
| grows to weigh 1Gb, use 1Gb of ram and 10% of your CPU, and
| your system is as sluggish nowadays (or even more) than it
| was 10y ago, with 10x the ram and processing power.
| anticrymactic wrote:
| > so that they don't have to do const
| $=document.getElementById,
|
| ``` const window.$ = (q)=>document.querySelector(q); ```
| Emulates the behavior much better. This is already set on
| modern version of browsers[1]
|
| [1] https://firefox-source-docs.mozilla.org/devtools-
| user/web_co...
| saagarjha wrote:
| Claude is pretty bad at coding UIs.
| int_19h wrote:
| It's not even true that HTML/JS is easier than something
| like, say, WPF.
| grimgrin wrote:
| I'd like to learn about the earliest pronunciations of these
| instructions. Only because watching a video earlier, I heard
| "MOV" pronounced "MAUV" not "MOVE"
|
| Not sure exactly how I could dig up pronunciations, except
| finding the oldest recordings
| jmmv wrote:
| > It gets better though! Since this is a very common operation,
| x86 CPUs spot this "zeroing idiom" early in the pipeline and can
| specifically optimise around it: the out-of-order tracking
| systems knows that the value of "eax" (or whichever register is
| being zeroed) does not depend on the previous value of eax, so it
| can allocate a fresh, dependency-free zero register renamer slot.
|
| While this is probably true ("probably" because I haven't checked
| it myself, but it makes sense), the CPU could do the exact same
| thing for "mov eax, 0", couldn't it? (Does it?)
| electroly wrote:
| Sure, lots of longer instructions have this effect. "xor
| eax,eax" is interesting because it's short. That zero immediate
| in "mov eax,0" is bigger than the entire "xor eax,eax"
| instruction.
| addaon wrote:
| Yes, "mov r, imm" also breaks dependencies -- but the immediate
| needs to be encoded, so the instruction is longer.
| MobiusHorizons wrote:
| I believe it does in some newer CPUs. It takes extra silicon to
| recognize the pattern though, and compilers emit the xor
| because the instruction is smaller, so I doubt there is much
| speed up in real workloads.
| lucozade wrote:
| > couldn't it? (Does it?)
|
| It could of course. It can do pretty much any pattern matching
| it likes. But I doubt very much it would because that pattern
| is way less common.
|
| As the article points out, the XOR saves 3 bytes of
| instructions for a really, really common pattern (to zero a
| register, particularly the return register).
|
| So there's very good reason to perform the XOR preferentially
| and hence good reason to optimise that very common idiom.
|
| Other approaches eg add a new "zero <reg>" instruction are
| basically worse as they're not backward compatible and don't
| really improve anything other than making the assembly a tiny
| bit more human readable.
| adrian_b wrote:
| Most Intel/AMD CPUs do the same thing for a few alternative
| instructions, e.g. "sub rax, rax".
|
| I do not think that anyone bothers to do this for a "mov eax,
| 0", because neither assembly programmers nor compilers use such
| an instruction. Either "xor reg,reg" or "sub reg,reg" have been
| the recommended instructions for clearing registers since 1978,
| i.e. since the launch of Intel 8086, because Intel 8086 lacked
| a "clear" instruction, like that of the competing CPUs from DEC
| or Motorola.
|
| One should remember that what is improperly named "exclusive
| or" in computer jargon is actually simultaneously addition
| modulo 2 and subtraction modulo 2 (because these 2 operations
| are identical; the different methods of carry and borrow
| generation distinguish addition from subtraction only for
| moduli greater than 2).
|
| The subtraction of a thing from itself is null, which is why
| clearing a register is done by subtracting it from itself,
| either with word subtraction or with bitwise modulo-2
| subtraction, a.k.a. XOR.
|
| (The true "exclusive or" operation is a logical operation
| distinct from the addition/subtraction modulo 2. These 2
| distinct operations are equivalent only for 2 operands. For 3
| or more operands they are different, but programmers still use
| incorrectly the term XOR when they mean the addition modulo 2
| of 3 or more operands. The true "exclusive" or is the function
| that is true only when exactly one of its operands is true,
| unlike "inclusive" or, which is true when at least one of its
| operands is true. To these 2 logical "or" functions correspond
| the 2 logical quantifiers "There exists a unique ..." and
| "There exists a ...".)
| rhaps0dy wrote:
| No RSS? I want to subscribe :'(
| sph wrote:
| "Who cares about RSS, no one uses it any more"
|
| There's dozens of us! By the way, totally unaffiliated, but I
| have used fetchrss for those websites that have no feed.
| ethin wrote:
| > In this case, even though rax is needed to hold the full 64-bit
| long result, by writing to eax, we get a nice effect: Unlike
| other partial register writes, when writing to an e register like
| eax, the architecture zeros the top 32 bits for free. So xor eax,
| eax sets all 64 bits to zero.
|
| I had no idea this happened. Talk about a fascinating bit of X86
| trivia! Do other architectures do this too? I'd imagine so, but
| you never know.
| 201984 wrote:
| AArch64 also zeroes the upper 32 bits of the destination
| register when you use a 32 bit instruction.
| flykespice wrote:
| I'm curious, why is that?
|
| I know x86-64 zeroes the upper part of the register for
| backwards compability and improve instruction cache (no need
| for REX prefix), but AArch64 is unclear for me.
| umanwizard wrote:
| I don't know either, but why wouldn't backwards
| compatibility apply to aarch64? It too is based on a pre-
| existing 32-bit architecture.
| 201984 wrote:
| It's to break dependencies for register renaming. If you
| have an instruction like mov w5, w6 // move
| low 32 bits of register 6 into low 32 bits of register 5
|
| This instruction only depends on the value of register 6.
| If instead it of zeroing the upper half it left it
| unchanged, then it would depend on w6 and also the previous
| value of register 5. That would constrain the renamer and
| consequently out-of-order execution.
| zeuxcg wrote:
| You really want to avoid a dependency on prior content of
| the destination register, to allow renaming and maximize
| out of order scheduling.
| monocasa wrote:
| A lot of the RISC architectures do something similar (sign
| extend rather than zero extend) when using 32 ops on a 64 bit
| processor. MIPS and PowerPC come to mind off of the top of my
| head. Being careful about that in the spec basically lets them
| treat 32-bit mode on a 64-bit processor as just 'mask off the
| top bits on any memory access'. Some of these processors will
| even let you use 64bit ops in 32bit mode, and really only just
| truncate memory addresses.
|
| So the real question is why does x86 zero extend rather than
| sign extend in these cases, and the answer is probably that by
| zero extending, with an implementation that treats a 64bit
| architectural register as a pair 32bit renamed physical
| registers, you can statically set the architectural upper
| register back on the free pool by marking it as zero rather
| than the sign extended result of an op.
| Quitschquat wrote:
| At some point I could disassemble 8086 (16 bit x86/real mode) as
| a kid. Byte sequences like 31 C9 or 31 C0 were a sure way to know
| if a loop of some kind was being initialized. Even simple
| compilers at the time made the mov xx, 0 - xor xx, xx
| optimization.
| kstrauser wrote:
| Why wasn't that a standard assembler macro, like ZEROAX or
| something? It seems to come up enough that it seems like there'd
| be a common shortcut for it.
|
| (Not suggesting it _should_ be. Maybe that 's a terrible idea,
| but I don't know why.)
| sfink wrote:
| I don't know, but one reason might be that with 8-bit opcodes
| you only have 256 instructions to play with, and many of those
| encode registers, so ZEROAX is burning a meaningful percentage
| of your total opcode space. And if you're not encoding it into
| a single byte, then it's pure waste: you already need XOR (and
| SUB), so you'd just be adding a redundant way of achieving the
| same thing. (Note that this argument doesn't completely hold
| up, since eg the 6502 had a fair number of undocumented opcodes
| largely because they didn't need all of them.)
|
| Though technically you said "assembler macro", not opcode. For
| that, I suspect the argument is more psychological: we had such
| limited resources of all sorts back then that being
| parsimonious with everything was a required mindset. The
| mindset didn't just mean you made everything as short as
| possible, it also meant you reused everything you possibly
| could. So reusing XOR just felt more fitting and natural than
| carving out a separate assembler instruction name. (Also, there
| would be the question of what effect ZEROAX should have on the
| flags, which can be somewhat inferred when reusing an existing
| instruction.)
| flustercan wrote:
| As a longtime developer currently perusing their first computer
| science degree, it makes me happy that I understood this article.
| Nearly makes all the trouble seem worth it.
| JuniperMesos wrote:
| > In this case, even though rax is needed to hold the full 64-bit
| long result, by writing to eax, we get a nice effect: Unlike
| other partial register writes, when writing to an e register like
| eax, the architecture zeros the top 32 bits for free. So xor eax,
| eax sets all 64 bits to zero.
|
| Huh, news to me. Although the amount of x86-64 assembly
| programming I've personally done is extremely minimal. Frankly,
| this is exactly the sort of architecture-specific detail I'm
| happy to let an ASM-generating library know for me rather than
| know myself.
| timonoko wrote:
| I had Atari Portfolio, sold without software, except basic
| command.com. I could only copy raw ASCII bytes from keyboard to
| file. When it is in COM-file, it is directly executable. What is
| the simple program I made from ASCII-bytes to read bytes in
| hexadecimal. I thought that for a week, but finally solved it.
|
| @grok: This is a classic and brilliant puzzle from the early days
| of DOS. The challenge is to write a self-displaying hex dumper
| (.COM file) using only the copy con command, which means every
| byte of the program must correspond to a standard, typeable ASCII
| character. This is a form of code-golfing that relies on a deep
| understanding of 8086 machine code.
|
| The solution is a clever piece of self-modifying code. Since many
| essential 8086 opcodes (like MOV, INT, RET) don't have printable
| ASCII equivalents, the program must construct these instructions
| in memory and then execute them.
|
| Here is a well-known 128-byte solution. I will first show you the
| ASCII characters you would type, and then explain how it works.
|
| The Program to Type
|
| You would have started by typing copy con hex.com and then
| carefully typed the following 128 characters (including spaces
| and case-sensitive characters) and finally pressed Ctrl+Z to save
| the file. 1 j!>1!QhH-@0X,jHj.Ph1PhD0Ph 0@h, j,
| h.0@h, h 0@h.H-@,jH1X,h.H.@,H1j j X,h.H.@,H1j j
| X.H-@H-@,jHj.Ph1PhE0Ph@0@h, j, h.0@h, h 0@h.
|
| How It Works: A High-Level Explanation
|
| This program is a marvel of 8086 trickery. Here is a simplified
| breakdown of what's happening:
|
| etc.etc
| timonoko wrote:
| My program was definitively shorter. I think I did not bother
| with real hexadecimals. Just used last four bytes of characters
| to make a full byte. Used it as a bootstrap program.
|
| @grok: While your exact code is lost to time, it would have
| looked something like one of the ultra-small ASCII "dropper"
| programs that were once passed around. Here is a plausible
| 32-byte example of what the program you typed might have looked
| like. You would have run copy con nibbler.com,
| typed the following line, and hit Ctrl+Z:
| `j%1!PZYfX0f1Xf1f1AYf1E_j%1!PZ` This looks like
| nonsense, but to the 8088/8086 processor, it's a dense set of
| instructions that does the following:
|
| etc etc.
| timonoko wrote:
| 97% of these millenials of HN do not understand the problem
| and its _brilliant_ solution. That is why I was truly
| astonished @grok grokked it rightaway.
|
| BTW. It is not beyond possibility that this nibbler or
| dropper was made by myself and published in Usenet by me
| myself in 1989. Who else would have such a problem.
|
| It was a bankcrupt sale and the machine was sold as
| "inactivated".
| kwertyoowiyop wrote:
| In this thread, we have found all the programmers born before
| 1975!
| vanderZwan wrote:
| Hey, some of us are younger and happened to get into
| programming via making games on their TI-83 graphing calculator
| in Z80!
| wildlogic wrote:
| I learned this trick writing shellcode - the shellcode has to be
| null byte (0x00) free, or it will terminate and not progress past
| the null byte, since it is the string terminator. of course, when
| you xor something with itself, the result is zero. the byte code
| generated by the instruction xor eax, eax doesn't contain null
| bytes, whereas mov eax, 0 does.
| ternaryoperator wrote:
| The origin AFAIK stems from the mainframe days. When using BAL
| (the assembly language for the IBM/360 family and its
| descendants), xoring was faster than moving 0 to the variable.
| Many of the early devs who wrote assembly for PCs came from
| mainframe backgrounds and so the idiom was carried over.
___________________________________________________________________
(page generated 2025-12-01 23:00 UTC)