hngopher.com

       [HN Gopher] Why xor eax, eax?
       ___________________________________________________________________
        
       Why xor eax, eax?
        
       Author : hasheddan
       Score  : 441 points
       Date   : 2025-12-01 12:22 UTC (10 hours ago)
        
 (HTM) web link (xania.org)
 (TXT) w3m dump (xania.org)
        
       | daeken wrote:
       | Back in 2005 or 2006, I was working at a little startup with "DVD
       | Jon" Johansen and we'd have Quake 3 tournaments to break up the
       | monotony of reverse-engineering and juggling storage
       | infrastructure. His name was always "xor eax,eax" and I always
       | just had to laugh at the idea of getting zeroed out by someone
       | with that name. (Which happened a lot -- I was good, but he was
       | much better!)
        
         | VectorLock wrote:
         | I was there but never got in on the Quake 3 fun; mp3t**
        
       | OgsyedIE wrote:
       | The page crashes after 3 seconds, 100% of the time, on the latest
       | version of Android Chrome and works fine on Brave, fyi.
        
         | robmccoll wrote:
         | This is not my experience on the latest version of Chrome
         | Android (142.0.7444.171). It did not crash for me.
        
       | pansa2 wrote:
       | > _Unlike other partial register writes, when writing to an e
       | register like eax, the architecture zeros the top 32 bits for
       | free._
       | 
       | I'm familiar with 32-bit x86 assembly from writing it 10-20 years
       | ago. So I was aware of the benefit of xor in general, but the
       | above quote was new to me.
       | 
       | I don't have any experience with 64-bit assembly - is there a
       | guide anywhere that teaches 64-bit specifics like the above?
       | Something like "x64 for those who know x86"?
        
         | veltas wrote:
         | Chapter 3 of volume 1, ctrl+f for "64-bit mode", has a lot of
         | the essentials including e.g. the stuff about zeroing out the
         | top half of the register.
         | 
         | https://www.intel.com/content/www/us/en/developer/articles/t...
        
         | matt_d wrote:
         | See
         | https://github.com/MattPD/cpplinks/blob/master/assembly.x86....
         | - mostly focused on x86-64 (and some of the talks/tutorials
         | offer pretty good overview)
        
         | sparkie wrote:
         | It's not only xor that does this, but most 32-bit operations
         | zero-extend the result of the 64-bit register. AMD did this for
         | backward compatibility. so existing programs would mostly
         | continue working, unlike Intel's earlier attempt at 64-bits
         | which was an entirely new design.
         | 
         | The reason `xor eax,eax` is preferred to `xor rax,rax` is due
         | to how the instructions are encoded - it saves one byte which
         | in turn reduces instruction cache usage.
         | 
         | When using 64-bit operations, a REX prefix is required on the
         | instruction (byte 0x40..0x4F), which serves two purposes - the
         | MSB of the low nybble (W) being set (ie, REX prefixes
         | 0x48..0x4f) indicates a 64-bit operation, and the low 3 bits of
         | low nybble allow using registers r8-r15 by providing an extra
         | bit for the ModRM register field and the base and index fields
         | in the SIB byte, as only 3-bits (8-registers) are provided by
         | x86.
         | 
         | A recent addition, APX, adds an additional 16 registers
         | (r16-r31), which need 2 additional bits. There's a REX2 prefix
         | for this (0xD5 ...), which is a two byte prefix to the
         | instruction. REX2 replaces the REX prefix when accessing
         | r16-r31, still contains the W bit, but it also includes an `M0`
         | bit, which says which of the two main opcode maps to use, which
         | replaces the 0x0F prefix, so it has no additional cost over the
         | REX prefix when accessing the second opcode map.
        
           | cesarb wrote:
           | > It's not only xor that does this, but most 32-bit
           | operations zero-extend the result of the 64-bit register. AMD
           | did this for backward compatibility.
           | 
           | It's not just that, zero-extending or sign-extending the
           | result is also better for out-of-order implementations. If
           | parts of the output register are preserved, the instruction
           | needs an extra dependency on the original value.
        
             | ychen306 wrote:
             | This. It's for renaming.
        
           | nickelpro wrote:
           | Except for `xchg eax, eax`, which was the canonical nop on
           | x86. Because it was supposed to do nothing, having it zero
           | out the top 32-bits of rax would be quite surprising. So it
           | doesn't.
           | 
           | Instead you need to use the multi-byte, general purpose
           | encoding of `xchg` for `xchg eax, eax` to get the expected
           | behavior.
        
       | snvzz wrote:
       | Because, unlike RISC-V, x86 has no x0 register.
        
         | jabl wrote:
         | From your past posting history, I presume that you're implying
         | this makes RISC-V better?
         | 
         | Do we have any data showing that having a dedicated zero
         | register is better than a short and canonical instruction for
         | zeroing an arbitrary register?
        
           | kevin_thibedeau wrote:
           | It's a definite liability on a machine with only 8 general
           | purpose registers. Losing 12% of the register space for a
           | constant would be a waste of hardware.
        
             | menaerus wrote:
             | 8 registers? Ever heard of register renaming?
        
               | Polizeiposaune wrote:
               | Ever heard of a loop that needed to keep more than 7
               | variables live? Register renaming helps with pipelining
               | and out-of-order execution, but instructions in the
               | program can only reference the architectural registers -
               | go beyond that and you end up needing to spill some
               | values to (architectural) memory.
               | 
               | There's a reason why AMD added r8-r15 to the
               | architecture, and why intel is adding r16-r31..
        
               | menaerus wrote:
               | I have but that was not the point? My first point was
               | exactly that there are more ISA registers and not only 8,
               | and therefore the question mark. My second point was
               | about register renaming which, contrary what you say,
               | does mitigate the artifacts of running out of registers
               | by spilling the variables to the stack memory. It does it
               | by eliminating the false dependencies between
               | variables/registers and xor eax, eax is a great candidate
               | for that.
        
               | saagarjha wrote:
               | Register renaming does not let you avoid spills.
        
               | menaerus wrote:
               | Ok, it obviously doesn't increase the number of ISA
               | registers. What I am suggesting is something else -
               | imagine a situation in which the compiler understands
               | that the spill over will take place, and therefore
               | rearranges the code such that it reduces the pressure on
               | the registers. It can do that if it can break the data
               | dependencies between the variables for instance. Or it
               | can do that by unrolling the loops or by moving the
               | initialization closer to where the variable is being
               | used, no? I am pretty certain that compilers are already
               | doing these kind of transformations, and in a sense this
               | is taking advantage of register renaming but indirectly.
        
               | account42 wrote:
               | That's irrelevant, the zero register would be taking a
               | slot in the limited register addressing bits in
               | instructions, not replace a physical register on the
               | chip.
        
               | kevin_thibedeau wrote:
               | 8086 doesn't have that.
        
           | dooglius wrote:
           | I think one could just pick a convention where a particular
           | GP register is zeroed at program startup and just make your
           | own zero register that way, getting all the benefits at very
           | small cost. The microarchitecture AIUI has a dedicated zero
           | register so any processor-level optimizations would still
           | apply.
        
             | pklausler wrote:
             | That's what was done on the CDC 6600 with two handy values,
             | B0 (0) and B1 (1).
        
           | gruez wrote:
           | It's basically the eternal debate of RISC vs CISC (x86). RISC
           | proponents claim RISC is better because it's simpler to
           | decode. CISC proponents retort that CISC means code can be
           | more compact, which helps with cache hits.
        
             | bluGill wrote:
             | In the real world there is no CISC or RISC anymore. RISC is
             | always extended to some new feature and suddenly becomes
             | more complex. Meanwhile CISC is just a decoder over a RISC
             | processor. Either way you get the best of both worlds:
             | simple hardware (the RISC internals and CSIC instructions
             | that do what you need.
             | 
             | Don't get too carried away in the above, x86 is still a lot
             | more complex than ARM or RISC-V. However the complexity is
             | only a tiny part of a CPU and so it doesn't matter.
        
           | phire wrote:
           | The zero register helps RISC-V (and MIPS before it) really
           | cut down on the number of instructions, and hardware
           | complexity.
           | 
           | You don't need a mov instruction, you just OR with $zero. You
           | don't need a load immediate instruction you just ADDI/ORI
           | with $zero. You don't need a Neg instruction, you just SUB
           | with $zero. All your Compare-And-Branch instructions get a
           | compare with $zero variant for free.
           | 
           | I refuse to say this "zero register" approach is better, it
           | is part of a wide design with many interacting features. But
           | once you have 31 registers, it's quite cheap to allocate one
           | register to be zero, and may actually save encoding space
           | elsewhere. (And encoding space is always an issue with fixed
           | width instructions).
           | 
           | AArch64 takes the concept further, they have a register that
           | is sometimes acts as the zero register (when used in ALU
           | instructions) and other times is the stack pointer (when used
           | in memory instructions and a few special stack instructions).
        
             | phkahler wrote:
             | >> The zero register helps RISC-V (and MIPS before it)
             | really cut down on the number of instructions, and hardware
             | complexity.
             | 
             | Which if funny because IMHO RISC-V instruction encoding is
             | garbage. It was all optimized around the idea of fixed
             | length 32-bit instructions. This leads to weird sized
             | immediates (12 bits?) and 2 instructions to load a 32 bit
             | constant. No support for 64 bit immediates. Then they
             | decided to have "compressed" instructions that are 16 bits,
             | so it's somewhat variable length anyway.
             | 
             | IMHO once all the vector, AI and graphics instructions are
             | nailed down they should make RISC-VI where it's almost the
             | same but re-encoding the instructions. Have sensible 16-bit
             | ones, 32-bit, and use immediate constants after the
             | opcodes. It seems like there is a _lot_ they could do to
             | clean it up - obviously not as much as x86 ;-)
        
               | zozbot234 wrote:
               | There's not a strong case for redoing the RISC-V encoding
               | with a new RISC-VI unless they run out of 32-bit encoding
               | space outright, due to e.g. extensive new vector-like or
               | AI-like instructions. And then they could free up a huge
               | amount of encoding space trivially by moving to a
               | 2-address format throughout with Rd=Rs1 and using a
               | simple instruction fusion approach MOV Rd - Rs1; OP Rd -
               | etc. for the former 3-address case.
               | 
               | (Any instruction that can be similarly rephrased as a
               | composition of more restricted elementary instructions is
               | also a candidate for this macro-insn approach.)
        
               | phkahler wrote:
               | >> Any instruction that can be similarly rephrased as a
               | composition of more restricted elementary instructions is
               | also a candidate for this macro-insn approach.
               | 
               | I really like the idea of composition or standard
               | prefixes. My favorite is the idea of replacing cmp/branch
               | with "if". Where the condition is a predicate for the
               | following instruction. For RISC-V it would eat a large
               | part of the 16bit opcodes. Some form of load/store might
               | be a good use for the remaining 16bit ops. Other things
               | that might be a good prefix could be encoding data types
               | (8,16,32,64 bit, sign extended, float, double) or a
               | source/destination register. It might be interesting to
               | see how a full ISA might be decomposed into smaller
               | instruction fragments.
        
               | zozbot234 wrote:
               | > "if". Where the condition is a predicate for the
               | following instruction
               | 
               | This is just a forward skip, which is optimized to a
               | predicated insn already in some implementations.
        
               | adgjlsfhk1 wrote:
               | IMO the riscv decoding is really elegant (arguably
               | excepting the C extension). Things like 64 bit immediates
               | are almost certainly a bad idea (as opposed to just
               | having a load from memory). Most 64 bit constants in use
               | can be sign extended from much smaller values, and for
               | those that can't, supporting 72 bit (or bigger)
               | instructions just to be able to load a 64 bit immediate
               | will necessarily bloat instruction cache, stall your
               | instruction decoder (or limit parallelism), and will only
               | be 2 cycles faster than a L1 cache load (if the
               | instruction is hot). 32 bit immediate would be kind of
               | nice, but the benefit is pretty small. An x86 instruction
               | with 32 bit immediate is 6 bytes, while the 2 RISC-V
               | instructions are 8 bytes. There have been proposals to
               | add 48 bit instructions, which would let Risc-v have 32
               | bit immediate support with the same 6 bytes as x86 (and
               | 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in
               | the very rare situations where doing so will be faster
               | than a load).
               | 
               | ISA design is always a tradeoff, https://ics.uci.edu/~swj
               | un/courses/2023F-CS250P/materials/le... has some good
               | details, but the TLDR is that RISC-V makes reasonable
               | choices for a fairly "boring" ISA.
        
               | Tuna-Fish wrote:
               | > Things like 64 bit immediates are almost certainly a
               | bad idea (as opposed to just having a load from memory)
               | 
               | Strongly disagree. Throughput is cheap, latency is
               | expensive. Any time you can fit a constant in the
               | instruction fetch stream is a win. This is especially
               | true for jump targets, because getting them resolved
               | faster both saves power and improves performance.
               | 
               | > Most 64 bit constants in use can be sign extended from
               | much smaller values
               | 
               | You should obviously also have smaller load instructions.
               | 
               | > will necessarily bloat instruction cache, stall your
               | instruction decoder (or limit parallelism)
               | 
               | No, just have more fetch throughput.
               | 
               | > and will only be 2 cycles faster than a L1 cache load
               | 
               | Only on tiny machines will L1 cache load be 2 cycles. On
               | a reasonable high-end machine it will be 4-5 cycles, and
               | more critically (because the latency would usually be
               | masked well by OoO), the energy required to engage the
               | load path is orders of magnitude more than just getting
               | it from the fetch.
               | 
               | And that's when it's not a jump target, when it's a jump
               | target suddenly loading it using a load instruction adds
               | 12+ cycles of latency.
               | 
               | > TLDR is that RISC-V makes reasonable choices for a
               | fairly "boring" ISA.
               | 
               | No. Not even talking about constants, RISC-V makes insane
               | choices for essentially religious reasons. Can you
               | explain to me why, exactly, would you ever make jal take
               | a register operand, instead of using a fixed link
               | register and putting the spare bits into the address
               | immediate?
        
               | adgjlsfhk1 wrote:
               | > No, just have more fetch throughput.
               | 
               | Fetch throughput isn't unlimited. Modern x86 CPUs only
               | have ~16-32B/cycle (from L2 once you're out of the uop
               | cache). If you decode a single 10 byte instruction you're
               | already using up a huge amount of the available decode
               | bandwidth.
               | 
               | There absolutely are cases where a 64 bit load
               | instruction would be an advantage, but ISA design is
               | always a case of tradeoffs. Allowing 10 byte instructions
               | has real cost in decode complexity, instruction bandwidth
               | requirements, ensuring cacheline/pageline alignment etc.
               | You have to weigh against that how frequent the
               | instruction would be as well as what your alternative
               | options are. Most imediates are small, and many others
               | can be efficiently synthesized via 2 other instructions
               | (e.g. shifts/xors/nots) and any synthesis that is 2
               | instructions or fewer will be cheaper than doing a load
               | anyway. As a result you would end up massively
               | complicating your architecture/decoders to benefit a
               | fairly rare instruction which probably isn't worthwhile.
               | It's notable that aarch64 makes the same tradeoff here
               | and Apple's M series processors have an IPC advantage
               | over the best x86.
               | 
               | > Can you explain to me why, exactly, would you ever make
               | jal take a register operand, instead of using a fixed
               | link register and putting the spare bits into the address
               | immediate?
               | 
               | This mostly seems like a mistake to me. The rational
               | probably is that you need the other instructions anyway
               | (not all jumps are returns), so adding a jal that doesn't
               | take a register would take a decent percentage of the
               | opspace, but the extra 5 bits would be very nice.
        
               | kruador wrote:
               | ARM64 also has fixed length 32-bit instructions. Yes,
               | immediates are normally small and it's not particularly
               | orthogonal as to how many bits are available.
               | 
               | The largest MOV available is 16 bits, but those 16 bits
               | can be shifted by 0, 16, 32 or 48 bits, so the worst case
               | for a 64-bit immediate is 4 instructions. Or the compiler
               | can decide to put the data in a PC-relative pool and use
               | ADR or ADRP to calculate the address.
               | 
               | ADD immediate is 12 bits but can optionally apply a
               | 12-bit left-shift to that immediate, so for immediates up
               | to 24 bits it can be done in two instructions.
               | 
               | ARM64 decoding is also pretty complex, _far_ less
               | orthogonal than ARM32. Then again, ARM32 was designed to
               | be decodable on a chip with 25,000 transistors, not where
               | you can spend thousands of transistors to decode a single
               | instruction.
        
           | wongarsu wrote:
           | MIPS for example also has one, along with a similar number of
           | registers (~32). So it's not like RISC-V took a radical new
           | position here, they were able to look back at what worked and
           | what didn't, and decided that for their target a zero
           | register was the right tradeoff. It's certainly the more
           | "elegant" solution. A zero register is useful as input or
           | output register for all kinds of operations, not just for
           | zeroing
        
         | crote wrote:
         | And the other way around: RISC-V doesn't have a move
         | instruction so that's done as "dst = src + 0", and it doesn't
         | have a nop instruction so that's done as "x0 = x0 + 0". There's
         | like a dozen of them.
         | 
         | It's quite interesting what neat tricks roll out once you've
         | got a guaranteed zero register - it greatly reduces the number
         | of distinct instructions you need for what is basically the
         | same operation.
        
           | dist1ll wrote:
           | Another one is "jalr x0, imm(x0)", which turns an indirect
           | branch into a direct jump to address "imm" in a single
           | instruction w/o clobbering a register. Pretty neat.
        
           | Findecanor wrote:
           | There is a `c.mv` instruction in the compressed set, which
           | most RISC-V processors implement.
           | 
           | That, _and_ `add rd, rs, x0` could (like the zeroing idiom on
           | x86), run entirely in the decoding and register-renaming
           | stages of a processor.
           | 
           | RISC-V does actually have quite a few idioms. Some idioms are
           | multi-instruction sequences ("macro ops") that could get
           | folded into single micro-ops ("macro-op fusion"/"instruction
           | fusion"): for example `lui` followed by `addi` for loading a
           | 32-bit constant, and left shift followed by right shift for
           | extracting a bitfield.
        
           | kruador wrote:
           | ARM64 assembly has a MOV instruction, but for most of the
           | ways it's used, it's an alias in the assembler to something
           | else. For example, MOV between two registers actually
           | generates ORR rd, rZR, rm, i.e. rd := (zero-register) OR rm.
           | Or, a MOV with a small immediate is ORR rd, rZR, #imm.
           | 
           | If trying to set the stack pointer, or copy the stack
           | pointer, instead the underlying instruction is ADD SP, Xn, #0
           | i.e. SP = Xn + 0. This is because the stack pointer and zero
           | register are both encoded as register 31 (11111). Some
           | instructions allow you to use the zero register, others the
           | stack pointer. Presumably ORR uses the zero register and ADD
           | the stack pointer.
           | 
           | NOP maps to HINT #0. There are 128 HINT values available;
           | anything not implemented on this processor executes as a NOP.
           | 
           | There are other operations that are aliased like CMP Xm, Xn
           | is really an alias for SUBS XZR, Xm, Xn: subtract Xn from Xm,
           | store the result in the zero register [i.e. discard it], and
           | set the flags. RISC-V doesn't have flags, of course. ARM Ltd
           | clearly considered them still useful.
           | 
           | There are other oddities, things like 'rotate right' is
           | encoded as 'extract register from pair of registers', but it
           | specifies the same source register twice.
           | 
           | Disassemblers do their best to hide this from you. ARM list a
           | 'preferred decoding' for any instruction that has aliases, to
           | map back to a more meaningful alias wherever possible.
        
         | gpderetta wrote:
         | x86 doesn't need a zero register as it can encode constants in
         | the instruction itself.
        
         | Findecanor wrote:
         | x86 has no architectural zero register, but a x86 CPU could
         | have a _microarchitectural_ zero register.
         | 
         | And when the instruction decoder in such a CPU with register
         | renaming sees `xor eax, eax`, it just makes `eax` point to the
         | zero register for instructions after it. It does not have to
         | put any instruction into the pipeline, and it takes effectively
         | 0 cycles. That is what makes the "zeroing idiom" so powerful.
        
       | omnicognate wrote:
       | It happens to be the first instruction of the first snippet in
       | the wonderful xchg rax,rax.
       | 
       | https://www.xorpd.net/pages/xchg_rax/snip_00.html
        
         | dooglius wrote:
         | Not sure what I am looking at here is this just a bunch of
         | different ways to zero registers?
        
           | omnicognate wrote:
           | It's a collection of interesting assembly snippets ("gems and
           | riddles" in the author's words) presented without commentary.
           | People have posted annotated "solutions" online, but figuring
           | out what the snippets do and why they are interesting is the
           | fun of it.
           | 
           | It's also available as an inscrutable printed book on Amazon.
        
         | mubou2 wrote:
         | That music when you click "int" is awesome. Reminds me of the
         | good ol' days of keygens.
        
           | Audiophilip wrote:
           | It's a chiptune-style xm module, "Funky Stars" by Quazar:
           | https://soundcloud.com/scene_music/funky-stars
        
           | therein wrote:
           | Keygen music will always have a special place in my heart.
           | This is a good one.
           | 
           | I do wonder who was the first cracker that thought of
           | including a keygen music that started the tradition.
           | 
           | I also miss how different groups competed with each other and
           | boasted about theirs while dissing others in readmes.
           | 
           | Readme's would have .NFO suffix and that would try to load in
           | some Windows tool but you had to open them in notepad. Good
           | times.
        
       | eb0la wrote:
       | I remember a lot of code zeroing registrers, dating _at least_
       | back from the IBM PC XT days (before the 80286).
       | 
       | If you decode the instruction, it makes sense to use XOR:
       | 
       | - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3
       | bytes (66 31 c0)
       | 
       | This extra byte in a machine with less than 1 Megabyte of memory
       | _did_ id matter.
       | 
       | In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00
       | 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)
       | 
       | Here Intel made the decision to use only 2 bytes. I bet this
       | helps both the instruction decoder and (of course) saves more
       | memory than the old 8086 instruction.
        
         | vardump wrote:
         | > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
         | 3 bytes (66 31 c0)
         | 
         | You don't need operand size prefix 0x66 when running 16 bit
         | code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax"
         | is just 2 bytes.
        
           | eb0la wrote:
           | My fault: I just compiled the instruction with an assembler
           | instead of looking up the actual instruction from
           | documentation.
           | 
           | It makes much more sense: resetting ax, and bc (xor ax,ax ;
           | xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster
           | to fetch by the x86 than the 3-octet version I wrote before.
        
         | RHSeeger wrote:
         | > the IBM PC XT days (before the 80286)
         | 
         | Fun fact - the IBM PC XT also came in a 286 model (the XT 286).
        
           | eb0la wrote:
           | You're right. I forgot that!
        
         | Anarch157a wrote:
         | I don't know enough of the 8086 so I don't know if this works
         | the same, but on the Z80 (which means it was probably true for
         | the 8080 too), XOR A would also clear pretty much all bits on
         | the flag register, meaning the flags would be in a known state
         | before doing something that could affect them.
        
           | vanderZwan wrote:
           | Which I guess is the same reason why modern Intel CPU
           | pipelines can rely on it for pipelining.
        
         | Someone wrote:
         | > If you decode the instruction, it makes sense to use XOR:
         | 
         | > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
         | 3 bytes (66 31 c0)
         | 
         | Except, apparently, on the pentium Pro, according to this
         | comment: https://randomascii.wordpress.com/2012/12/29/the-
         | surprising-..., which says:
         | 
         |  _"But there was at least one out-of-order design that did not
         | recognize xor reg, reg as a special case: the Pentium Pro. The
         | Intel Optimization manuals for the Pentium Pro recommended
         | "mov" to zero a register."_
        
         | Sharlin wrote:
         | As the author says, a couple of extra bytes _still_ matter,
         | perhaps more than 20ish years ago. There are vast amounts of
         | RAM, sure, but it 's glacially slow, and there's only a few
         | tens of kBs of L1 instruction cache.
         | 
         | Never mind the fact that, as the author also mentions, the xor
         | idiom takes essentially zero cycles to execute because nothing
         | actually _happens_ besides assigning a new pre-zeroed physical
         | register to the logical register name early on in the pipeline,
         | after which the instruction is retired.
        
           | cogman10 wrote:
           | L1 instruction cache is backed by L2 and L3 caches.
           | 
           | For the AMD 9950, we are talking about 1280kb of L1 (per
           | core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if
           | you have the X3D version).
           | 
           | I won't say it doesn't matter, but it doesn't matter as much
           | as it once did. CPU caches have gotten huge while the
           | instructions remain the same size.
           | 
           | The more important part, at this point, is it's idiomatic.
           | That means hardware designers are much more likely to put in
           | specialty logic to make sure it's fast. It's a common enough
           | operation to deserve it's own special cases. You can fit a
           | lot of 8 byte instructions into 1280kb of memory. And as it
           | turns out, it's pretty common for applications to spend a lot
           | of their time in small chunks of instructions. The slow part
           | of a lot of code will be that `for loop` with the 30 AVX
           | instructions doing magic. That's why you'll often see
           | compilers burn `NOP` instructions to align a loop. That's to
           | avoid splitting a cache line.
        
             | Sharlin wrote:
             | > For the AMD 9950, we are talking about 1280kb of L1 (per
             | core). 16MB of L2 (per core)
             | 
             | Ryzen 9 CPUs have 1280kB of L1 _in total_. 80kB (48+32) per
             | core, and the 9 series is the first in the _entire history_
             | of Ryzens to have some other number than 64 (32+32)
             | kilobytes of L1 per core. The 16MB L2 figure is also
             | _total_. 1MB per core, same as the 7 series. AMD obviously
             | touts the total, not per-core, amounts in their marketing
             | materials because it looks more impressive.
        
               | kbolino wrote:
               | Also, rather importantly, the L1 _i_ (instruction) cache
               | is still only 32 kB. The part that got bigger, the 48 kB
               | of L1 _d_ (data) cache, does not count for this purpose.
        
               | monocasa wrote:
               | Yeah, the reason for that is that it's expensive in PPA
               | for the size of an L1 cache to exceed number of ways
               | times page size. The jump to 48kB was also a jump to 12
               | way set associative.
               | 
               | As an aside, zen 1 did actually have a 64kB (and only 4
               | way!) L1I cache, but changed to the page size times way
               | count restriction with zen 2, reducing the L1 size by
               | half.
               | 
               | You can also see this on the apple side, where their
               | giant 192kB caches L1I are 12 ways with a 16kB page size.
        
             | gpderetta wrote:
             | Instruction caches also prefetch very well, as long as
             | branch prediction is good. Of course on a misprediction you
             | might also suffer a cache miss in addition to the normal
             | penalty.
        
           | umanwizard wrote:
           | > nothing actually happens besides assigning a new pre-zeroed
           | physical register to the logical register name early on in
           | the pipeline, after which the instruction is retired.
           | 
           | This is slightly inaccurate -- instructions retire in order,
           | so it doesn't necessarily retire immediately after it's
           | decoded and the new zeroed register is assigned. It has to
           | sit in the reorder buffer waiting until all the instructions
           | ahead of it are retired as well.
           | 
           | Thus in workloads where reorder buffer size is a bottleneck,
           | it could contribute to that. However I doubt this describes
           | most workloads.
        
             | Sharlin wrote:
             | Thanks, that makes sense.
        
         | chasd00 wrote:
         | > - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs
         | 3 bytes (66 31 c0)
         | 
         | iirc doesn't word alignment matter? I have no idea if this is
         | how the IBM PC XT was aligned but if you had 4 byte words then
         | it doesn't matter if you save a byte with xor because you
         | wouldn't be able to use it for anything else anyway. again,
         | iirc.
        
       | deadcore wrote:
       | Matt Godbolt also uploads to his self titled Youtube channel:
       | https://www.youtube.com/watch?v=eLjZ48gqbyg
        
         | vanderZwan wrote:
         | Not sure why you got downvoted for pointing that out - it might
         | be linked at the end of the article but people can still miss
         | that.
        
           | deadcore wrote:
           | *shrugs* the internet being the internet I suppose.
           | 
           | There was "See the video that accompanies this post." but NGL
           | was just posting encase anyone didn't have time to read or
           | missed it.
        
       | pclmulqdq wrote:
       | In modern CPUs, a lot of these are recognized as zeroing idioms
       | and they end up doing the same thing (often a register renaming
       | trick). Using the shortest one makes sense. If you use a really
       | weird zeroing pattern, you can also see it as a backend uop while
       | many of these zeroing idioms are elided by the frontend on some
       | cores.
        
       | Dwedit wrote:
       | Because "sub eax,eax" looks stupid. (and also clears the carry
       | flag, unlike "xor eax, eax")
        
         | tom_ wrote:
         | xor clears the carry as well? In fact, looks like xor and sub
         | affect the same set of flags!
         | 
         | xor:
         | 
         | > The OF and CF flags are cleared; the SF, ZF, and PF flags are
         | set according to the result. The state of the AF flag is
         | undefined.
         | 
         | sub:
         | 
         | > The OF, SF, ZF, AF, PF, and CF flags are set according to the
         | result.
         | 
         | (I don't have an x64 system handy, but hopefully the reference
         | manual can be trusted. I dimly remembered this, or something
         | like it, tripping me up after coming from programming for the
         | 6502.)
        
           | trollbridge wrote:
           | This is a good thing since the pipeline now doesn't have to
           | track the state of the flags since they all got zero'd.
        
           | sfink wrote:
           | Strangely, the only difference on the flags is that AF
           | (auxiliary carry) is undefined for `xor eax, eax` but
           | guaranteed to be zeroed for `sub eax, eax`. I don't know what
           | that means in practice, though I'm guessing that at the very
           | least the hardware would not treat it as a dependency on the
           | previous value.
        
         | HackerThemAll wrote:
         | If I remember correctly, sub used to be slower than xor on some
         | ancient architectures.
        
       | sylware wrote:
       | Remnant of RISC attempt without a zero register.
        
       | fooker wrote:
       | It's funny how machine code is a high level language nowadays,
       | for this example the CPU recognizes the zeroing pattern and does
       | something quite a bit different.
        
         | Reubensson wrote:
         | What do you mean that cpu does something different? Isnt cpu
         | doing what is being asked, that being xor with consequence of
         | zeroing when given two same values.
        
           | IsTom wrote:
           | I think OP means that it has come a long way from the simple
           | mental model of uops being a direct execution of operations
           | and with all the register renamings and so on
        
           | dooglius wrote:
           | FTA:
           | 
           | > And, having done that it removes the operation from the
           | execution queue - that is the xor takes zero execution
           | cycles!1 It's essentially optimised out by the CPU
        
           | fooker wrote:
           | Same consequence yes.
           | 
           | But it will not execute xor, nor will it actually zero out
           | eax in most cases.
           | 
           | It'll do something similar to constant propagation with the
           | information that whenever xor eax, eax occurs; all uses of
           | eax go through a simpler execution path until eax is
           | overwritten.
        
           | 12_throw_away wrote:
           | > with consequence of zeroing when given two same values
           | 
           | Right, it has the same consequence, but it doesn't actually
           | perform the stated operation. ASM is just a now just a high
           | level language that tells the computer to "please give me the
           | same state that a PDP-11-like computer would give me upon
           | executing these instructions."
        
         | dheatov wrote:
         | It's really impressive how powerful and efficient it has
         | become. However, I find it so much more difficult to build
         | mental model of it. I've been struggling with atomic and r/w
         | barrier as there are sooo many ways the instructions could've
         | been executed (or not executed!).
        
           | fooker wrote:
           | It's a consequence of keeping our general purpose single
           | threaded programming model the same for five decades.
           | 
           | It has it's merits, but the underlying hardware has changed.
           | 
           | Intel tried to push this responsibility to the compiler with
           | Itanium but that failed catastrophicically, so we're back to
           | the CPU pretending it's 1985.
        
       | silverfrost wrote:
       | Back on the Z80 'xor a' is the shortest sequence to zero A
        
       | fortran77 wrote:
       | Back when I did IBM 370 BAL Assembly Language, we did the same
       | thing to clear a register to zero.                 XR   15,15
       | XOR REGISTER 15 WITH REGISTER 15
       | 
       | vs                 L    15,=F'0'      LOAD REGISTER 15 WITH 0
       | 
       | This was alleged to be faster on the 370 because because XR
       | operated entirely within the CPU registers, and L (Load) fetched
       | data from memory (i.e.., the constant came from program memory).
        
       | vanderZwan wrote:
       | > _In my 6502 hacking days, the presence of an exclusive OR was a
       | sure-fire indicator you'd either found the encryption part of the
       | code, or some kind of sprite routine._
       | 
       | Meanwhile, people like me who got started with a Z80 instead
       | immediately knew why, since XOR A is the smallest _and_ fastest
       | way to clear the accumulator and flag register. Funny how that
       | also shows how specific this is to a particular CPU lineage or
       | its offshoots.
        
       | jgrahamc wrote:
       | _In my 6502 hacking days, the presence of an exclusive OR was a
       | sure-fire indicator you'd either found the encryption part of the
       | code, or some kind of sprite routine._
       | 
       | Yeah, sadly the 6502 didn't allow you to do EOR A; while the Z80
       | did allow XOR A. If I remember correctly XOR A was AF and LD A, 0
       | was 3E 01[1]. So saved a whole byte! And I think the XOR was 3
       | clock cycles fast than the LD. So less space taken up by the
       | instruction and faster.
       | 
       | I have a very distinct memory in my first job (writing x86
       | assembly) of the CEO walking up behind my desk and pointing out
       | that I'd done MOV AX, 0 when I could have done XOR AX, AX.
       | 
       | [1] 3E 00
        
         | vanderZwan wrote:
         | Hah, we commented on the exact same paragraph within a minute
         | of each other! My memory agrees with your memory, although I
         | think that should be 3E 00. Let me look that up:
         | 
         | https://jnz.dk/z80/ld_r_n.html
         | 
         | https://jnz.dk/z80/xor_r.html
         | 
         | Yep, if I'm reading this right that's 3E 00, since the second
         | byte is the immediate value.
         | 
         | One difference between XOR and LD is that LD A, 0 does not
         | affect flags, which sometimes mattered.
        
           | jgrahamc wrote:
           | You're right. Of course, it's 3E 00. Not sure how I
           | remembered 3E 01. My only excuse is that it was 40 years ago!
        
           | sfink wrote:
           | What is this "LD A, 0" syntax? Is it a z80 thing?
           | 
           | One of the random things burned into my memory for 6502
           | assembly is that LDA is $A9. I never separated the
           | instruction from the register; it's not like they were
           | general purpose. But that might be because I learned
           | programming from the 2 books that came with my C64, a BASIC
           | manual and a machine code reference manual, and that's how
           | they did it.
           | 
           | I learned assembly programming by reading through the list of
           | supported instructions. That, and typing in games from
           | Compute's Gazette and manually disassembling the DATA
           | instructions to understand how they worked. Oh, and the zero-
           | page reference.
           | 
           | Good times.
        
             | Narishma wrote:
             | > One of the random things burned into my memory for 6502
             | assembly is that LDA is $A9. I never separated the
             | instruction from the register; it's not like they were
             | general purpose.
             | 
             | You had LDA and LDX and LDY as separate instructions while
             | the Z80 assembler had a single LD instruction with
             | different operands. It's the same thing really.
        
         | wavemode wrote:
         | > CEO walking up behind my desk and pointing out that I'd done
         | MOV AX, 0 when I could have done XOR AX, AX
         | 
         | Now that's what I call micromanagement.
         | 
         | (sorry couldn't resist)
        
           | jgrahamc wrote:
           | He was right though. We were memory and cycle constrained and
           | I'd wasted both!
        
           | xigoi wrote:
           | The real joke is that a CEO had actual technical knowledge
           | instead of just being there for decoration.
        
           | mkornaukhov wrote:
           | Similarly, the CEO couldn't resist the outstanding
           | optimization of memory and execution speed!
        
             | 6510 wrote:
             | No one believes this story.
        
               | jgrahamc wrote:
               | I am sad you don't believe this story. The CEO was very
               | technical and this is exactly the sort of thing he would
               | spot.
        
               | bombcar wrote:
               | People don't realize that in the era of dinosaurs where
               | MASM ruled and assembly walked the earth, there basically
               | WEREN'T CEOs who didn't know the details, because all the
               | companies doing this stuff were pretty small at the time
               | (and the CEO may have been writing it himself a few years
               | before).
        
               | Analemma_ wrote:
               | There was a time when Bill Gates wrote code for
               | Microsoft, and he was actually quite good at it.
        
               | nomel wrote:
               | Not sure why this was voted down. He was very technical,
               | especially for the time:
               | https://www.thecrimson.com/article/2025/6/7/bill-gates-
               | reuni...
        
               | OrderlyTiamat wrote:
               | My first part time dev job as a student featured me
               | walking in on our CEO who showed me he was recompiling
               | his kernel to enable some features. I'm quite sure he was
               | just doing that to impress the students, but at least he
               | knew how to!
        
               | 6510 wrote:
               | Similarly, if you told people in the 80's that it would
               | be the opposite in the future no one would believe it
               | either.
               | 
               | Not even the developers are very technical in the future!
               | 
               | Woah, really? And they still manage to write good
               | software?
               | 
               | Of course not, if good software would be standing next to
               | their bed at 4 am they would scream who are you what are
               | you doing here? help! help! Someone, make it go away!
        
           | crest wrote:
           | I had to pad the code for alignment reasons. ;-)
        
           | ksherlock wrote:
           | I mean, he IS the Chief EORfficer
        
         | anonzzzies wrote:
         | 3E 00 : I was on MSX and never had an assembler when you so I
         | only remember the Hex, never actually knew the instructions; I
         | wrote programs/games by data 3E,00,CD,etc without comments
         | saying LD A as I never knew those at the time.
        
           | unnah wrote:
           | Umm... how did you manage to learn those hex codes? You just
           | read a lot of machine code and it started to make sense?
        
             | jgrahamc wrote:
             | I started out writing machine code without an assembler and
             | so had to hand assemble a lot of stuff. After a while you
             | end up just knowing the common codes and can write your
             | program directly. This was also useful because it was
             | possible to write or modify programs directly through an
             | interface sometimes called a "front panel" where you could
             | change individual bytes in memory.
             | 
             | Back in 1985 I did some hand-coding like this because I
             | didn't have access to an assembler:
             | https://blog.jgc.org/2013/04/how-i-coded-in-1985.html and I
             | typed the whole program in through the keypad.
        
               | stevekemp wrote:
               | Same here. On/For the ZX Spectrum, looking up the hex-
               | codes in the back of the orange book. At least it was
               | spiral-bound to make it easier.
               | 
               | Later still I'd be patching binaries to ensure their
               | serial-checks passed, on Intel.
        
             | kragen wrote:
             | The instruction sets were a lot simpler at the time. The
             | 8080 instruction set listing is only a few pages, and some
             | of that is instructions you rarely use like RRC and DAA.
             | The operand fields are always in the same place. My own
             | summary of the instruction set is at
             | https://dercuano.github.io/notes/8080-opcode-
             | map.html#addtoc....
        
             | amirhirsch wrote:
             | I implemented a PDP-11 in 2007-10 and I can still read
             | PDP-11 Octal
        
             | af78 wrote:
             | I had a similar experience of writing machine code for
             | Z80-based computers (Amstrad CPC) in the 90's, as a
             | teenager. I didn't have an assembler so I manually
             | converted mnemonics to hex. I still remember a few opcodes:
             | CD for CALL, C9 for RET, 01 for LD BC, 21 for LD HL...
             | Needless to say, the process was tedious and error-prone.
             | Calculating relative jumps was a pain. So was keeping track
             | of offsets and addresses of variables and jump targets. I
             | tended to insert nops to avoid having to recalculate
             | everything in case I needed to modify some code... I can't
             | say I miss these times.
             | 
             | I'm quite sure none of my friends knew any CPU opcode;
             | however, people usually remembered a few phone numbers.
        
             | senderista wrote:
             | It wasn't unusual in the 80s to type in machine code
             | listings to a PC; I remember doing this as an 8-year-old
             | from magazines, but I didn't understand any of the stuff I
             | was typing in.
        
             | anonzzzies wrote:
             | Typing from mags, getting interested in how the magic works
             | by learning to use a hex monitor and trying out things. I
             | was a kid so time enough.
             | 
             | I didn't know you could do it differently for years after I
             | started.
        
         | stevefan1999 wrote:
         | > In my 6502 hacking days, the presence of an exclusive OR was
         | a sure-fire indicator you'd either found the encryption part of
         | the code, or some kind of sprite routine.
         | 
         | Correct. Most ciphers of that era should be Feistel cipher in
         | the likes of DES/3DES, or even RC4 uses XOR too. Later
         | AES/Rijndael, CRC and ECC (Elliptic Curve Cryptography) also
         | make heavy use of XOR but in finite field terms which is based
         | on modular arithmetic over GF(2), that effectively reduces to
         | XOR (while in theory should be mod 2).
        
           | OhMeadhbh wrote:
           | I was going to say "but RC4 and AES were published well after
           | the 6502's heyday," but NESes were completely rocking it in
           | '87 (and I'm told 65XX cores were used as the basis for
           | several hard drive controllers of the era.) Alas, the closest
           | I ever came to encryption on a (less than 32-bit system) was
           | lucifer on an IBM channel controller in the forever-ago and
           | debugging RC5 on an 8085.
        
             | kjs3 wrote:
             | _I 'm told 65XX cores were used as the basis for several
             | hard drive controllers of the era_
             | 
             | Western Design Center is still (apparently) making a profit
             | at least in part licensing 6502 core IP for embedded stuff.
             | There's probably a 6502 buried and unrecognized in all
             | sorts of low-cost control applications laying around you.
             | 
             |  _RC5 on an 8085_
             | 
             | Oof. Well played.
        
               | PaulHoule wrote:
               | I dunno. The 6502 has been a $2 part for a long time but
               | needs RAM and some glue logic, for a similar price you
               | can get an AVR-8 [1] or ESP-32 [2] and get some RAM and
               | GPIO.
               | 
               | [1] faster, more registers than the IBM 360, << 64k RAM
               | 
               | [2] much faster, 32bit, >> 64k RAM
        
               | rzzzt wrote:
               | There are uC versions like the W65C134S:
               | https://www.westerndesigncenter.com/wdc/w65c134s-chip.php
        
               | kjs3 wrote:
               | _I dunno._
               | 
               | You don't know what, exactly? You can go to the web site
               | and see what they are selling.
               | 
               |  _The 6502 has been a $2 part for a long time_
               | 
               | I doubt that for an IP license at any volume such a thing
               | would make sense.
               | 
               |  _but needs RAM and some glue logic_
               | 
               | Sure? Embedded in whatever you're building.
               | 
               |  _for a similar price you can get..._
               | 
               | Oh, sorry...my bad. You were doing it the HN way: "Don't
               | actually read what was written for comprehension...just
               | take your first knee jerk and tell them how you would
               | obviously do it better.".
        
           | ASalazarMX wrote:
           | Reading cryptography was that advanced at that time, I'm even
           | more surprised that the venerable Norton Utilities for MS-DOS
           | required a password, that was simply XORed with some constant
           | and embedded in the executables. If the reserved space was
           | zeroes, it considered it a fresh install and demanded a new
           | password.
           | 
           | If it had been properly encrypted my young cracker self would
           | have had no opportunity.
        
         | mmphosis wrote:
         | Try to keep the value 0 in the Y register.                 echo
         | tya|asm|mondump -r|6502
         | A=AA X=00 Y=00 S=00 P=22 PC=0300  0       0300- 98        TYA
         | A=00 X=00 Y=00 S=00 P=22 PC=0301  2
        
         | favorited wrote:
         | "Prefer `xor a` instead of `ld a, 0`" is basically the first
         | optimization that you learn when doing SM83 assembly.
         | 
         | https://github.com/pret/pokecrystal/wiki/Optimizing-assembly...
        
       | bitwize wrote:
       | Because mov eax, 0 requires fetching a constant and prolongs
       | instruction fetching/execution. XOR A was a trick I learned back
       | in the Z80 days.
        
       | dintech wrote:
       | My brain read this is "Why not ear wax?"
        
         | kragen wrote:
         | xor wax, wax    ; clear wax         xor sax, sax    ; clear sax
         | xor fax, fax    ; tru tru
        
       | jabedude wrote:
       | similarly IIRC, on (some generations of) x86 chips, NOP is sugar
       | around `XCHG EAX, EAX` which is effectively a do-nothing
       | operation
        
         | bitwize wrote:
         | This is pretty much all x86 chips as far as I'm aware: opcode
         | 0x90 which is equivalent to XCHG AX,AX.
         | 
         | The 8080 and Z80's NOP was at opcode 0. Which was neat because
         | you could make a "NOP slide" simply by zeroing out memory.
        
         | kccqzy wrote:
         | There are multiple variants of nop mainly because you sometimes
         | need the nop instruction to take up a certain number of bytes
         | for alignment purposes. You have the 1-byte nop, but there is
         | also the 9-byte nop.
        
       | sixthDot wrote:
       | I've wrote a lot of `xor al,al` in my youth.
        
       | flohofwoe wrote:
       | The actually surprising part to me is that such an important
       | instruction uses a two byte encoding instead of one byte :)
        
         | kccqzy wrote:
         | Even supporting just 8 registers that would take up
         | 8/256=0.03125 of the instruction encoding space.
        
       | BiraIgnacio wrote:
       | Also cool this got at the top item on the HN front page
        
       | HackerThemAll wrote:
       | > Interestingly, when zeroing the "extended" numbered registers
       | (like r8), GCC still uses the d (double width, ie 32-bit)
       | variant.
       | 
       | Of course. I might have some data stored in the higher dword of
       | that register.
        
         | rfl890 wrote:
         | Which will still be zeroed.
        
         | Tuna-Fish wrote:
         | Clearing e8 also clears the upper half.
         | 
         | Partial register updates are kryptonite to OoO engines. For
         | people used to low-level programming weak machines, it seems
         | natural to just update part of a register, but the way every
         | modern OoO CPU works that is literally not a possible
         | operation. Registers are written to exactly once, and this
         | operation also frees every subsequent instruction waiting for
         | that register to be executed. Dirty registers don't get written
         | to again, they are garbage collected and reset for next
         | renaming.
         | 
         | The only way to implement partial register updates is to add
         | 3-operand instructions, and have the old register state to be
         | the third input. This is also more expensive than it sounds
         | like, and on many modern CPUs you can execute only one
         | 3-operand integer instruction per clock, vs 4+ 2-operand ones.
        
       | charles_f wrote:
       | > By using a slightly more obscure instruction, we save three
       | bytes every time we need to set a register to zero
       | 
       | Meanwhile, most "apps" we get nowadays contain half of npmjs
       | neatly bundled in electron. I miss the days when default was
       | native and devs had constraints to how big their output could be.
        
         | Filligree wrote:
         | JS is just easier and takes less code.
         | 
         | Which isn't an excuse anymore. UI coding isn't that hard; if
         | someone can't do it, well, Claude certainly can.
        
           | charles_f wrote:
           | I'm fine with that, but keeping _some_ consideration to
           | optimization should still be something, even in environments
           | when constraints are low. The problem is when no-one cares
           | and includes 4 versions of jquery in their app so that they
           | don 't have to do const $=document.getElementById, everything
           | grows to weigh 1Gb, use 1Gb of ram and 10% of your CPU, and
           | your system is as sluggish nowadays (or even more) than it
           | was 10y ago, with 10x the ram and processing power.
        
             | anticrymactic wrote:
             | > so that they don't have to do const
             | $=document.getElementById,
             | 
             | ``` const window.$ = (q)=>document.querySelector(q); ```
             | Emulates the behavior much better. This is already set on
             | modern version of browsers[1]
             | 
             | [1] https://firefox-source-docs.mozilla.org/devtools-
             | user/web_co...
        
           | saagarjha wrote:
           | Claude is pretty bad at coding UIs.
        
           | int_19h wrote:
           | It's not even true that HTML/JS is easier than something
           | like, say, WPF.
        
       | grimgrin wrote:
       | I'd like to learn about the earliest pronunciations of these
       | instructions. Only because watching a video earlier, I heard
       | "MOV" pronounced "MAUV" not "MOVE"
       | 
       | Not sure exactly how I could dig up pronunciations, except
       | finding the oldest recordings
        
       | jmmv wrote:
       | > It gets better though! Since this is a very common operation,
       | x86 CPUs spot this "zeroing idiom" early in the pipeline and can
       | specifically optimise around it: the out-of-order tracking
       | systems knows that the value of "eax" (or whichever register is
       | being zeroed) does not depend on the previous value of eax, so it
       | can allocate a fresh, dependency-free zero register renamer slot.
       | 
       | While this is probably true ("probably" because I haven't checked
       | it myself, but it makes sense), the CPU could do the exact same
       | thing for "mov eax, 0", couldn't it? (Does it?)
        
         | electroly wrote:
         | Sure, lots of longer instructions have this effect. "xor
         | eax,eax" is interesting because it's short. That zero immediate
         | in "mov eax,0" is bigger than the entire "xor eax,eax"
         | instruction.
        
         | addaon wrote:
         | Yes, "mov r, imm" also breaks dependencies -- but the immediate
         | needs to be encoded, so the instruction is longer.
        
         | MobiusHorizons wrote:
         | I believe it does in some newer CPUs. It takes extra silicon to
         | recognize the pattern though, and compilers emit the xor
         | because the instruction is smaller, so I doubt there is much
         | speed up in real workloads.
        
         | lucozade wrote:
         | > couldn't it? (Does it?)
         | 
         | It could of course. It can do pretty much any pattern matching
         | it likes. But I doubt very much it would because that pattern
         | is way less common.
         | 
         | As the article points out, the XOR saves 3 bytes of
         | instructions for a really, really common pattern (to zero a
         | register, particularly the return register).
         | 
         | So there's very good reason to perform the XOR preferentially
         | and hence good reason to optimise that very common idiom.
         | 
         | Other approaches eg add a new "zero <reg>" instruction are
         | basically worse as they're not backward compatible and don't
         | really improve anything other than making the assembly a tiny
         | bit more human readable.
        
         | adrian_b wrote:
         | Most Intel/AMD CPUs do the same thing for a few alternative
         | instructions, e.g. "sub rax, rax".
         | 
         | I do not think that anyone bothers to do this for a "mov eax,
         | 0", because neither assembly programmers nor compilers use such
         | an instruction. Either "xor reg,reg" or "sub reg,reg" have been
         | the recommended instructions for clearing registers since 1978,
         | i.e. since the launch of Intel 8086, because Intel 8086 lacked
         | a "clear" instruction, like that of the competing CPUs from DEC
         | or Motorola.
         | 
         | One should remember that what is improperly named "exclusive
         | or" in computer jargon is actually simultaneously addition
         | modulo 2 and subtraction modulo 2 (because these 2 operations
         | are identical; the different methods of carry and borrow
         | generation distinguish addition from subtraction only for
         | moduli greater than 2).
         | 
         | The subtraction of a thing from itself is null, which is why
         | clearing a register is done by subtracting it from itself,
         | either with word subtraction or with bitwise modulo-2
         | subtraction, a.k.a. XOR.
         | 
         | (The true "exclusive or" operation is a logical operation
         | distinct from the addition/subtraction modulo 2. These 2
         | distinct operations are equivalent only for 2 operands. For 3
         | or more operands they are different, but programmers still use
         | incorrectly the term XOR when they mean the addition modulo 2
         | of 3 or more operands. The true "exclusive" or is the function
         | that is true only when exactly one of its operands is true,
         | unlike "inclusive" or, which is true when at least one of its
         | operands is true. To these 2 logical "or" functions correspond
         | the 2 logical quantifiers "There exists a unique ..." and
         | "There exists a ...".)
        
       | rhaps0dy wrote:
       | No RSS? I want to subscribe :'(
        
         | sph wrote:
         | "Who cares about RSS, no one uses it any more"
         | 
         | There's dozens of us! By the way, totally unaffiliated, but I
         | have used fetchrss for those websites that have no feed.
        
       | ethin wrote:
       | > In this case, even though rax is needed to hold the full 64-bit
       | long result, by writing to eax, we get a nice effect: Unlike
       | other partial register writes, when writing to an e register like
       | eax, the architecture zeros the top 32 bits for free. So xor eax,
       | eax sets all 64 bits to zero.
       | 
       | I had no idea this happened. Talk about a fascinating bit of X86
       | trivia! Do other architectures do this too? I'd imagine so, but
       | you never know.
        
         | 201984 wrote:
         | AArch64 also zeroes the upper 32 bits of the destination
         | register when you use a 32 bit instruction.
        
           | flykespice wrote:
           | I'm curious, why is that?
           | 
           | I know x86-64 zeroes the upper part of the register for
           | backwards compability and improve instruction cache (no need
           | for REX prefix), but AArch64 is unclear for me.
        
             | umanwizard wrote:
             | I don't know either, but why wouldn't backwards
             | compatibility apply to aarch64? It too is based on a pre-
             | existing 32-bit architecture.
        
             | 201984 wrote:
             | It's to break dependencies for register renaming. If you
             | have an instruction like                 mov w5, w6 // move
             | low 32 bits of register 6 into low 32 bits of register 5
             | 
             | This instruction only depends on the value of register 6.
             | If instead it of zeroing the upper half it left it
             | unchanged, then it would depend on w6 and also the previous
             | value of register 5. That would constrain the renamer and
             | consequently out-of-order execution.
        
             | zeuxcg wrote:
             | You really want to avoid a dependency on prior content of
             | the destination register, to allow renaming and maximize
             | out of order scheduling.
        
         | monocasa wrote:
         | A lot of the RISC architectures do something similar (sign
         | extend rather than zero extend) when using 32 ops on a 64 bit
         | processor. MIPS and PowerPC come to mind off of the top of my
         | head. Being careful about that in the spec basically lets them
         | treat 32-bit mode on a 64-bit processor as just 'mask off the
         | top bits on any memory access'. Some of these processors will
         | even let you use 64bit ops in 32bit mode, and really only just
         | truncate memory addresses.
         | 
         | So the real question is why does x86 zero extend rather than
         | sign extend in these cases, and the answer is probably that by
         | zero extending, with an implementation that treats a 64bit
         | architectural register as a pair 32bit renamed physical
         | registers, you can statically set the architectural upper
         | register back on the free pool by marking it as zero rather
         | than the sign extended result of an op.
        
       | Quitschquat wrote:
       | At some point I could disassemble 8086 (16 bit x86/real mode) as
       | a kid. Byte sequences like 31 C9 or 31 C0 were a sure way to know
       | if a loop of some kind was being initialized. Even simple
       | compilers at the time made the mov xx, 0 - xor xx, xx
       | optimization.
        
       | kstrauser wrote:
       | Why wasn't that a standard assembler macro, like ZEROAX or
       | something? It seems to come up enough that it seems like there'd
       | be a common shortcut for it.
       | 
       | (Not suggesting it _should_ be. Maybe that 's a terrible idea,
       | but I don't know why.)
        
         | sfink wrote:
         | I don't know, but one reason might be that with 8-bit opcodes
         | you only have 256 instructions to play with, and many of those
         | encode registers, so ZEROAX is burning a meaningful percentage
         | of your total opcode space. And if you're not encoding it into
         | a single byte, then it's pure waste: you already need XOR (and
         | SUB), so you'd just be adding a redundant way of achieving the
         | same thing. (Note that this argument doesn't completely hold
         | up, since eg the 6502 had a fair number of undocumented opcodes
         | largely because they didn't need all of them.)
         | 
         | Though technically you said "assembler macro", not opcode. For
         | that, I suspect the argument is more psychological: we had such
         | limited resources of all sorts back then that being
         | parsimonious with everything was a required mindset. The
         | mindset didn't just mean you made everything as short as
         | possible, it also meant you reused everything you possibly
         | could. So reusing XOR just felt more fitting and natural than
         | carving out a separate assembler instruction name. (Also, there
         | would be the question of what effect ZEROAX should have on the
         | flags, which can be somewhat inferred when reusing an existing
         | instruction.)
        
       | flustercan wrote:
       | As a longtime developer currently perusing their first computer
       | science degree, it makes me happy that I understood this article.
       | Nearly makes all the trouble seem worth it.
        
       | JuniperMesos wrote:
       | > In this case, even though rax is needed to hold the full 64-bit
       | long result, by writing to eax, we get a nice effect: Unlike
       | other partial register writes, when writing to an e register like
       | eax, the architecture zeros the top 32 bits for free. So xor eax,
       | eax sets all 64 bits to zero.
       | 
       | Huh, news to me. Although the amount of x86-64 assembly
       | programming I've personally done is extremely minimal. Frankly,
       | this is exactly the sort of architecture-specific detail I'm
       | happy to let an ASM-generating library know for me rather than
       | know myself.
        
       | timonoko wrote:
       | I had Atari Portfolio, sold without software, except basic
       | command.com. I could only copy raw ASCII bytes from keyboard to
       | file. When it is in COM-file, it is directly executable. What is
       | the simple program I made from ASCII-bytes to read bytes in
       | hexadecimal. I thought that for a week, but finally solved it.
       | 
       | @grok: This is a classic and brilliant puzzle from the early days
       | of DOS. The challenge is to write a self-displaying hex dumper
       | (.COM file) using only the copy con command, which means every
       | byte of the program must correspond to a standard, typeable ASCII
       | character. This is a form of code-golfing that relies on a deep
       | understanding of 8086 machine code.
       | 
       | The solution is a clever piece of self-modifying code. Since many
       | essential 8086 opcodes (like MOV, INT, RET) don't have printable
       | ASCII equivalents, the program must construct these instructions
       | in memory and then execute them.
       | 
       | Here is a well-known 128-byte solution. I will first show you the
       | ASCII characters you would type, and then explain how it works.
       | 
       | The Program to Type
       | 
       | You would have started by typing copy con hex.com and then
       | carefully typed the following 128 characters (including spaces
       | and case-sensitive characters) and finally pressed Ctrl+Z to save
       | the file.                  1 j!>1!QhH-@0X,jHj.Ph1PhD0Ph 0@h, j,
       | h.0@h, h 0@h.H-@,jH1X,h.H.@,H1j j X,h.H.@,H1j j
       | X.H-@H-@,jHj.Ph1PhE0Ph@0@h, j, h.0@h, h          0@h.
       | 
       | How It Works: A High-Level Explanation
       | 
       | This program is a marvel of 8086 trickery. Here is a simplified
       | breakdown of what's happening:
       | 
       | etc.etc
        
         | timonoko wrote:
         | My program was definitively shorter. I think I did not bother
         | with real hexadecimals. Just used last four bytes of characters
         | to make a full byte. Used it as a bootstrap program.
         | 
         | @grok: While your exact code is lost to time, it would have
         | looked something like one of the ultra-small ASCII "dropper"
         | programs that were once passed around. Here is a plausible
         | 32-byte example of what the program you typed might have looked
         | like.                 You would have run copy con nibbler.com,
         | typed the following line, and hit Ctrl+Z:
         | `j%1!PZYfX0f1Xf1f1AYf1E_j%1!PZ`            This looks like
         | nonsense, but to the 8088/8086 processor, it's a dense set of
         | instructions that does the following:
         | 
         | etc etc.
        
           | timonoko wrote:
           | 97% of these millenials of HN do not understand the problem
           | and its _brilliant_ solution. That is why I was truly
           | astonished @grok grokked it rightaway.
           | 
           | BTW. It is not beyond possibility that this nibbler or
           | dropper was made by myself and published in Usenet by me
           | myself in 1989. Who else would have such a problem.
           | 
           | It was a bankcrupt sale and the machine was sold as
           | "inactivated".
        
       | kwertyoowiyop wrote:
       | In this thread, we have found all the programmers born before
       | 1975!
        
         | vanderZwan wrote:
         | Hey, some of us are younger and happened to get into
         | programming via making games on their TI-83 graphing calculator
         | in Z80!
        
       | wildlogic wrote:
       | I learned this trick writing shellcode - the shellcode has to be
       | null byte (0x00) free, or it will terminate and not progress past
       | the null byte, since it is the string terminator. of course, when
       | you xor something with itself, the result is zero. the byte code
       | generated by the instruction xor eax, eax doesn't contain null
       | bytes, whereas mov eax, 0 does.
        
       | ternaryoperator wrote:
       | The origin AFAIK stems from the mainframe days. When using BAL
       | (the assembly language for the IBM/360 family and its
       | descendants), xoring was faster than moving 0 to the variable.
       | Many of the early devs who wrote assembly for PCs came from
       | mainframe backgrounds and so the idiom was carried over.
        
       ___________________________________________________________________
       (page generated 2025-12-01 23:00 UTC)