hngopher.com

       [HN Gopher] Reptar
       ___________________________________________________________________
        
       Reptar
        
       Author : abhi9u
       Score  : 571 points
       Date   : 2023-11-14 17:49 UTC (1 days ago)
        
 (HTM) web link (lock.cmpxchg8b.com)
 (TXT) w3m dump (lock.cmpxchg8b.com)
        
       | saagarjha wrote:
       | See also Intel's advisory, which has a description of impact:
       | https://www.intel.com/content/www/us/en/security-center/advi...
       | 
       | > Sequence of processor instructions leads to unexpected behavior
       | for some Intel(R) Processors may allow an authenticated user to
       | potentially enable escalation of privilege and/or information
       | disclosure and/or denial of service via local access.
        
         | yborg wrote:
         | 'Some' appears to be almost any Intel x86 CPU made in the last
         | 6 years.
        
       | tedunangst wrote:
       | Their diagnosis reminds me of what happened when qemu ran into
       | repz ret. https://repzret.org/p/repzret/
        
       | Lammy wrote:
       | > the processor would begin to report machine check exceptions
       | and halt.
       | 
       | I get it https://www.youtube.com/watch?v=dXekDCcw2FE
        
         | shadowgovt wrote:
         | ... it literally took me all Goddamn day. Well done.
         | 
         | Credit where credit is due: Google has some of the best
         | codenames.
        
       | doublerabbit wrote:
       | Any reason to why its named after the dinosaur from the cartoon
       | Rugrats? Or was that what was on TV at the time?
       | 
       | Maybe I should start hacking while watching Teenage Mutant Ninja
       | Turtles.
        
         | 2OEH8eoCRo0 wrote:
         | rep is an assembly instruction prefix
        
         | Blackthorn wrote:
         | I think from the memey line "Halt! I am Reptar!" Plus the rep
         | prefix
        
         | AdmiralAsshat wrote:
         | If you discover a major processor vulnerability and wanna name
         | it Shredder/Krang/Bebop/Rocksteady, I feel like you will have
         | earned that right!
        
       | xyst wrote:
       | Reading this makes me realize how little I know of the hardware
       | that runs my software
       | 
       | > Prefixes allow you to change how instructions behave by
       | enabling or disabling features
       | 
       | Why do we need "prefixes" to disable or enable features? Is this
       | for dynamically toggling feature so you don't have to go into
       | BIOS?
        
         | jeffbee wrote:
         | It's just because x86 as an ISA has accreted over the course of
         | 40+ years, and has variable-length instructions. Every time
         | they extend the ISA they carve out part of the opcode space to
         | squeeze in a new prefix. This will only continue, considering
         | that Intel has proposed another new scheme this year.
        
         | shenberg wrote:
         | Prefixes are modifiers to specific instructions executed by the
         | processor, e.g. to control the size of the operands or enable
         | locking for concurrency.
        
         | Tuna-Fish wrote:
         | x86 was designed in 78, basically for the purpose of running a
         | primitive laser printer (or other similar workloads). The big
         | problem with this is that the encoding space for instructions
         | was "efficiently utilized". When new instructions, or worse,
         | additional registers were later added, you had to fit the new
         | instruction variants in somehow, and you did this by tacking on
         | prefixes.
        
           | mschuster91 wrote:
           | Nah, x86 goes even earlier in its heritage - it was,
           | effectively, a bolt-on on Intel's way older designs, as a
           | huge part of the 8086 was being ASM source-compatible with
           | the older 8xxx chips, even as the instruction set itself
           | changed [1]. What utterly amazes me is that the original 8086
           | was mostly designed _by hand_ by a team of not even two dozen
           | people - and today, we got hundreds if not thousands of
           | people working on designing ASICs...
           | 
           | [1] https://en.wikipedia.org/wiki/Intel_8086#The_first_x86_de
           | sig...
        
             | hulitu wrote:
             | It is because testing plays a bigger part today than back
             | then. The complexity has also increased (people do not
             | design at transistor level anymore).
        
             | irdc wrote:
             | Acckkghtually, if you go back far enough you end up at the
             | Datapoint 2200. If you want to understand where some of the
             | crazier parts of the 8086 originate from, Ken Shirriff has
             | a nice read: http://www.righto.com/2023/08/datapoint-
             | to-8086.html
        
           | thaumasiotes wrote:
           | > x86 was designed in 78, basically for the purpose of
           | running a primitive laser printer
           | 
           | It's interesting that ASCII is transparently just a bunch of
           | control codes for a physical printer/typewriter, combining
           | things like "advance the paper one line", "advance the paper
           | one inch", "reset the carriage position", and "strike an F at
           | the carriage position", all of which are different mechanical
           | actions that you might want a typewriter to do.
           | 
           | But now we have Unicode, which is dedicated to the purpose of
           | assigning ID numbers to visual glyphs, and ASCII has been
           | interpreted as a bunch of glyph references instead of a bunch
           | of machine instructions, and there are the control codes with
           | no visual representation, sitting in Unicode, being
           | inappropriate in every possible way.
           | 
           | It's kind of like if Unicode were to incorporate "start
           | microwave" as part of a set with "1", "2", "3", etc.
        
             | rswail wrote:
             | ASCII was used by teletypes, not typewriters. They were
             | "cylinder" heads, as compared to IBM's golfball
             | typewriters.
             | 
             | The endless CR/LF/CRLF line ending problem would have been
             | solved if the RS (Record Separator) ASCII code was used
             | instead of the physical CR = carriage return, ie move print
             | head back to start of line, and LF = line feed, ie rotate
             | paper up one line.
             | 
             | But Unix decided on LF, Apple used CR, Windows used CRLF,
             | and even today, I had to get a guy to stop setting his
             | system to "Windows" because he was screwing up a git repo
             | with extraneous CRs.
        
         | db48x wrote:
         | Read
         | https://wiki.osdev.org/X86-64_Instruction_Encoding#Legacy_Pr...
         | 
         | The REP prefixes are the most common; they just let you perform
         | the same instruction a variable number of times. It looks in
         | the CX register for the count. This makes many common loops
         | really, really short, especially for moving objects around in
         | memory. The memcpy function is often inlined as a single REP
         | MOVS instruction, possibly with an instruction to copy the
         | count into CX if it isn't already there.
         | 
         | I suppose the REX (operand size) prefix is pretty common too,
         | since 64-bit programs will want to operate on 64-bit values and
         | addresses pretty frequently.
         | 
         | None of the prefixes toggle things that can be set globally, by
         | the BIOS or otherwise. They all just specify things that the
         | next instruction needs to do.
        
           | pclmulqdq wrote:
           | The ModR/M and SIB prefixes are probably the most common
           | prefixes in instructions. They are so common that assemblers
           | elide their existence when you read code. REX is in the same
           | boat: so common that it's usually elided. The VEX prefix is
           | also really common (all of the V* AVX instructions, like
           | VMOVDQ), and then the LOCK prefix (all atomics).
           | 
           | After all of those, REP is not that uncommon of a prefix to
           | run into, although many people prefer SIMD memcpy/memset to
           | REP MOVSB/REP STOSB. It is slightly unusual.
        
             | bonzini wrote:
             | ModRM and SIB are not a prefix, they're part of the opcode
             | (second and third byte after all the prefixes and the
             | 0Fh/0F38h/0F3Ah opcode map selectors)
        
               | EarlKing wrote:
               | More specifically, they're affixed to _certain_ opcodes
               | that require them. There are a number of byte-sized
               | opcodes that do not require a ModRM or SIB byte (although
               | a number of those got gobbled up to make the REX prefix,
               | but that 's another story).
               | 
               | TL;DR Weeee! Intel machine language is crazy!
        
             | EarlKing wrote:
             | There's a good reason for using vector instructions over
             | REP: Until relatively recently that was how you got maximum
             | performance in small, tight loops. REP is making a comeback
             | precisely because of ERMS and FSRM, so unfortunately this
             | will become a bigger problem going forward.
        
             | epcoa wrote:
             | This isn't correct. ModR/M and SIB are _not_ prefixes. They
             | are suffixes and essentially part of the core instruction
             | encoding for certain memory and register access
             | instruction. they are the primary means of encoding the
             | myriad addressing modes of the x86. And their existence is
             | not elided in any meaningful way, their value is explicitly
             | derived from the instruction operands (SIB is scale, index,
             | base), so when you see an instruction like:
             | 
             | mov BYTE PTR [rdi+rbx*4],0x4
             | 
             | SIB is determined by the register indices of rdi, rbx, and
             | 4, all right there in the instruction. Likewise, Mod R/M
             | encodes the addressing mode, which is clear from the
             | operands in the assembler listing. Though x86 is such as
             | mess that there are cases where you can encode the same
             | instruction in either a Mod R/M form or a shorter form, eg
             | PUSH/POP.
             | 
             | REX is a prefix, but it is a bit special as it must be the
             | last one, and repeats are undefined. It is not elided
             | because of commonality but because its presence and value
             | is usually implied from the operands, it is therefore
             | redundant to list it.
             | 
             | For instance, PUSH R12 must use a REX prefix (REX.B with
             | the one byte encoding).
        
         | epcoa wrote:
         | That's a very poor summary of what prefixes are. My advice,
         | just skip the original article which isn't very good or
         | interesting and read taviso's blog that is linked in the top
         | comment (it gives a few concrete examples of these prefixes).
         | They are modifiers that are part of the CPU instruction.
        
         | ajross wrote:
         | "Prefixes" in this case mostly expand the instruction encoding
         | space.
         | 
         | So rarely-used addressing modes get a "segment prefix" that
         | causes them to use a segment other than DS. Or x86_64 added a
         | "REX" prefix that added more bits to the register fields
         | allowing for 16 GPRs. Likewise the "LOCK" prefix (though poorly
         | specified originally) causes (some!) memory operations to be
         | atomic with respect to the rest of the system (c.f. "LOCK
         | CMPXCHG" to effect a compare-and-set).
         | 
         | All these things are operations other CPU architectures
         | represent too, though they tend to pack them into the existing
         | instruction space, requiring more bits to represent every
         | instruction.
         | 
         | Notably the "REP" prefix in question turns out to be the one
         | exception. This is a microcoded repeat prefix left over from
         | the ancient days. But it represents operations (c.f.
         | memset/memmove) that are performance-sensitive even today, so
         | it's worthwhile for CPU vendors to continue to optimize them.
         | Which is how the bug in question seems to have happened.
        
         | jasonwatkinspdx wrote:
         | You got some great answers already, but to your first point
         | check out Hennessey and Patterson's books, namely Computer
         | Architecture and Computer Organization and Design.
         | 
         | The latter is probably more suited to you unless you wanna go
         | on a dive into computer architecture itself. There's older
         | editions available for free (authorized by the authors) on the
         | web.
         | 
         | I first read the 3rd edition of Computer Architecture and
         | besides being one of the most clear textbooks I've ever read it
         | vastly improved my understanding of what's going on in there in
         | relation to OoO speculative execution, etc.
        
       | rvba wrote:
       | It looks like Intel was cutting corners to be faster than AMD and
       | now all those thigs come out. How much slower will all those
       | processors be after multiple errata? 10%? 30%? 50%?
       | 
       | In a duopoly market there seems to be no real competition. And
       | yes I know that some (not all) bugs also happen for AMD.
        
         | mschuster91 wrote:
         | > And yes I know that some (not all) bugs also happen for AMD.
         | 
         | Some of these novel side-channel attacks actually even apply in
         | completely unrelated architectures such as ARM [1] or RISC-V
         | [2].
         | 
         | I think the problem is not (just) a lack of competition
         | (although you're right that the duopoly in desktop/laptop/non-
         | cloud servers for x86 brings its own serious issues, I've
         | written and ranted more often than I can count [3]), it rather
         | is that modern CPUs and SoCs have simply become so utterly
         | complex and loaded with decades worth of backwards-
         | compatibility baggage that it is impossible for any single
         | human, even a small team of the best experts you can bring
         | together, to fully grasp every tiny bit of them.
         | 
         | [1] https://www.zdnet.com/article/arm-cpus-impacted-by-rare-
         | side...
         | 
         | [2]
         | https://www.sciencedirect.com/science/article/pii/S004579062...
         | 
         | [3]
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
        
           | bobim wrote:
           | So no saving grace from the ISA... humans just lost ground on
           | CPU design, and I suspect the situation will worsen when AI
           | will enter the picture.
        
             | mschuster91 wrote:
             | > and I suspect the situation will worsen when AI will
             | enter the picture.
             | 
             | For now, AI lacks the contextual depth - but an AI that can
             | actually _design_ a CPU from scratch (and not just
             | rehashing prior-art VHDL it has ... learned? somehow), if
             | that happens we 'll be at a Cambrian Explosion-style event
             | anyway, and all we can do is stand on the sides, munch
             | popcorn and remember this tiny quote from Star Wars [1].
             | 
             | [1] https://www.youtube.com/watch?v=Xr9s6-tuppI
        
               | nwmcsween wrote:
               | Once AI can create itself, we will most likely be
               | redundant.
        
           | snvzz wrote:
           | >Some of these novel side-channel attacks actually even apply
           | in completely unrelated architectures such as ARM [1] or
           | RISC-V [2].
           | 
           | Possible? Yes. But far less likely.
           | 
           | Complexity carries over and breeds bugs. RISC-V is an order
           | of magnitude simpler than ARM64, which in turn is an order of
           | magnitude simpler than x86.
           | 
           | And it is so w/o disadvantage[0], positioning itself as the
           | better ISA.
           | 
           | 0. https://news.ycombinator.com/item?id=38272318
        
         | arp242 wrote:
         | It's not clear to me this fix will have any performance impact.
         | I strongly suspect it will be negligible or zero.
         | 
         | This seems like a "simple" bug of the type that people write
         | every day, not deep architectural problems like Spectre and the
         | like, which also affected AMD (in roughly equal measure if I
         | recall correctly).
        
           | kmeisthax wrote:
           | Parent commenter might be thinking of Meltdown, a related
           | architectural bug that only bit Intel and IBM PPC. Everything
           | with speculative execution has Spectre[0], but you only have
           | Meltdown if you speculate across _security boundaries_.
           | 
           | The reason why Meltdown has a more dramatic name than
           | Spectre, despite being the same vulnerability, is that
           | hardware privilege boundaries are the only defensible
           | boundary against timing attacks. We already expect context
           | switches to be expensive, so we're allowed to make them a
           | little _more_ expensive. It 'd be prohibitively expensive to
           | avoid leaking timing from, say, one executable library to a
           | block of JIT compiled JavaScript code within the same browser
           | content process.
           | 
           | [0] https://randomascii.wordpress.com/2018/01/07/finding-a-
           | cpu-d...
        
         | akoboldfrying wrote:
         | Not sure what other errata you're referring to, but this looks
         | like an off-by-one in the microcode. I would expect the fix to
         | have zero or minimal penalty.
        
       | varispeed wrote:
       | It's going to be a pain for cloud and shared hosting.
       | 
       | Most likely dedicated resources on demand will be the future.
       | Some companies already offer it.
        
         | kevincox wrote:
         | GCP and AWS both offer non-shared hardware. If people want the
         | extra isolation they just need to pay for it.
        
       | Flow wrote:
       | Would be possible to describe a modern CPU in something like TLA+
       | to find all non-electrical problems like these?
        
         | boxfire wrote:
         | There are still bit flipping tricks like rowhammer for RAM, I
         | wouldn't be surprised if there are such vulnerabilities in some
         | CPUs.
        
           | sterlind wrote:
           | Rowhammer is an electrical vulnerability though. PP specified
           | non-electrical vulns.
        
         | sterlind wrote:
         | I've heard Intel does use TLA+ extensively for specifying their
         | designs and verifying their specs. But TLA+ specs are extremely
         | high-level, so they don't capture implementation details that
         | can lead to bugs. And model checking isn't a formal proof, only
         | (tractably small) finite state spaces can be checked with TLC.
         | And even there, you're only checking the invariants you
         | specified.
         | 
         | That said, I'm sure there's some verification framework like
         | SPARK for VHDL, and this feels like exactly the kind of thing
         | it should catch.
        
         | dboreham wrote:
         | Formal methods have been used in CPU design for nearly 40 years
         | [1] but not yet for everything, and the methods tend to not
         | have "round-trip-engineering" properties (e.g. TLA+ is not
         | actually proving validity of the code you will run in
         | production, just your description of its behavior and your idea
         | of exhaustive test cases).
         | 
         | [1] https://www.academia.edu/60937699/The_IMS_T_800_Transputer
        
         | foobiekr wrote:
         | CPU designers are so professional about verification and
         | specification that they _dwarf_ software. There's just no
         | comparison.
        
       | bobim wrote:
       | Is it even possible to design a cpu with out-of-order and
       | speculative execution that would have no security issue? Is the
       | future leads to a swarm of disconnected A55 cores each running a
       | single application?
        
         | SmoothBrain12 wrote:
         | Yes, but they won't clock as fast because they'll be waiting
         | for RAM.
        
           | bobim wrote:
           | We need to keep programs small so they fit in the cache.
        
             | moffkalast wrote:
             | We need 2 GBs of L1 cache, thus solving the cache miss
             | problem once and for all.
        
               | rep_lodsb wrote:
               | 640K should be enough for anyone ;)
        
         | Tuna-Fish wrote:
         | This vulnerability was not caused by OoO or speculative
         | execution. It was caused by the fact that x86 was designed 45
         | years ago, and has had feature after feature piled on the same
         | base, which has never been adequately rebuilt.
         | 
         | The more proximate cause is that some instructions with
         | multiple redundant prefixes (which is legal, but pointless)
         | have their length miscalculated by some Intel CPUs, which
         | results in wrong outcomes.
        
           | epcoa wrote:
           | Not entirely pointless, redundant prefixes are occasionally
           | the useful method for alignment.
        
             | TheCoreh wrote:
             | A more sensible approach for that use-case would be IMO to
             | have well-defined specialized prefixes for padding, instead
             | of relying on the case-by-case behavior of redundant
             | prefixes. (However I understand that there's almost
             | certainly a good historical reason why this was not the way
             | it was done)
        
               | bobim wrote:
               | Are new ISA solving this? Time to move to Risc V?
        
               | epcoa wrote:
               | N/A and No.
        
               | dontlaugh wrote:
               | RISC V is not great at this either, with the compression
               | extension being common and variable length.
               | 
               | ARM 64 gets this right, with fixed length 32 bit
               | instructions.
        
               | snvzz wrote:
               | >ARM 64 gets this right, with fixed length 32 bit
               | instructions.
               | 
               | At the expense of code density, yet RISC-V is easy to
               | decode, with implementations going up to 12-way decode
               | (Veyron V2) despite variable length.
               | 
               | ARM64 hardly "gets it right".
        
               | camel-cdr wrote:
               | I wouldn't say ARM64 gets it wrong either, I think both
               | are viable approaches.
        
               | snvzz wrote:
               | Both approaches are viable, but RISC-V's approach is
               | better, as it provides higher code density without
               | imposing a significant increase in complexity in
               | exchange.
               | 
               | Higher code density is valuable. E.g.:
               | 
               | - The decoders can see more by looking at a window of
               | code of the same size, or we can have a narrowed window.
               | 
               | - We can have less cache and save area and power. We can
               | also clock the cache higher, enabled by it being smaller,
               | lowering latency cycles.
               | 
               | - Smaller binaries or rom image.
               | 
               | Soon to be available (2024) large, high performance
               | implementations will demonstrate RISC-V advantages well.
        
               | kccqzy wrote:
               | The easiest way of doing padding is to add a bunch of
               | `nop` instructions which are one byte each.
               | 
               | If you read the manual, Intel encourages minor variations
               | of the `nop` instructions that can be lengthened into
               | different number of bytes (like `nop dword ptr [eax]` or
               | `nop dword ptr [eax + eax*1 + 00000000h]`).
               | 
               | It is never recommended anywhere in my knowledge to rely
               | on redundant prefixes of random non-nop instructions.
        
               | epcoa wrote:
               | NOPs are not generally free.
               | 
               | It's a pretty old and well known technique:
               | 
               | https://stackoverflow.com/questions/48046814/what-
               | methods-ca...
               | 
               | Note that this technique is really only legitimate where
               | the used prefix already has defined behavior with the
               | given instruction ("Use of repeat prefixes and/or
               | undefined opcodes with other Intel 64 or IA-32
               | instructions is reserved; such use may cause
               | unpredictable behavior."), and of course the REX prefix
               | has special limitations. The key is redundant, not
               | spurious. It is not a good idea to be doing rep add for
               | example. But otherwise, there is no issue.
        
               | epcoa wrote:
               | The prefixes are _redundant_ so it 's not really case-by-
               | case behavior. You're just repeating the prefix you would
               | be using anyway in that location.
               | 
               | Using specialized prefixes wastes encoding space for no
               | real gain. You realize on most common processors NOP
               | itself is a pseudo-instruction? Even the apparently meme-
               | worthy (see sibling comment) RISC-V, it's ADDI x0, x0, 0.
        
               | tedunangst wrote:
               | And then there are CPUs that retcon behavioral changes
               | onto nops.
               | 
               | > Moving a register to itself is functionally a nop, but
               | the processor overloads it to signal information about
               | priority.
               | 
               | https://devblogs.microsoft.com/oldnewthing/20180809-00/?p
               | =99...
        
               | _a_a_a_ wrote:
               | > A program can voluntarily set itself to low priority if
               | it is waiting for a spin lock
               | 
               | What does this even mean? How can a program do this when
               | thread priority is an OS thing? It's seems just weird.
        
               | epcoa wrote:
               | Hardware threads as in SMT means thread priority is also
               | a hardware thing.
        
               | tedunangst wrote:
               | It's an SMT CPU that dynamically assigns decode,
               | registers, etc. https://course.ece.cmu.edu/~ece740/f13/li
               | b/exe/fetch.php?med...
        
               | shadowgovt wrote:
               | Usually, the historical reason is that adding the logic
               | to do something well-defined when unexpected prefixes are
               | used is going to cost ten more transistors per chip,
               | which is going to add to cost to handle a corner case
               | that almost nobody will try to be in anyway. Far better
               | to let whatever the implementation does happen as long as
               | what happens doesn't break the system.
               | 
               | The issue here is their verification of possible internal
               | CPU states didn't account for this one.
               | 
               | (There is, perhaps, an argument to be made that the x86
               | architecture has become _so_ complex that the emulator
               | between its embarrassingly stupid PDP-11-style single-
               | thread codeflow and the embarrassingly parallel
               | computation it does under the hood to give the user more
               | performance than a really fast PDP-11 _cannot_ be
               | reliably tested to exhaustion, so perhaps something needs
               | to give on the design or the cost of the chips).
        
             | iforgotpassword wrote:
             | Because they cost no/less cycles compared to NOPs?
        
               | tedunangst wrote:
               | See http://repzret.org/p/repzret/
        
           | gumby wrote:
           | > It was caused by the fact that x86 was designed 45 years
           | ago, and has had feature after feature piled on the same
           | base, _which has never been adequately rebuilt_.
           | 
           | Itanic would like to object! Unfortunately it can't get
           | through the door.
        
         | nextaccountic wrote:
         | I think formal methods could help designing of such machine, if
         | you can write a mathematical statement that amounts to "there
         | is no side channel between A and B"
         | 
         | Or at least put a practical bound on how many bits per second
         | at most you can from any such side channel (the reasoning
         | being, if you can get at most a bit for each million years, you
         | probably don't have an attack)
         | 
         | Then you verify if a given design meets this constraint
        
           | mgaunard wrote:
           | A program is itself a formal specification of what an
           | algorithm does.
        
           | bobim wrote:
           | What would be the typical size of such a constraint-based
           | problem, and do we have the compute power to translate the
           | rules into an implementation? And what if one forgot a rule
           | somewhere... Deeply interesting subject.
        
             | less_less wrote:
             | I think you'd want it to be a theorem (in Lean, Coq,
             | Isabelle/HOL or whatever) instead of a constraint problem.
             | So it would be more limited by developer effort than by
             | computational power.
             | 
             | Theoretically you can do this from software down to
             | (idealized) gates, but in practice the effort is so great
             | that it's only been done in extremely limited systems.
        
           | tsimionescu wrote:
           | Formal methods are widely used in processor design. It is
           | hard to formalize specs to assert behaviors that bugs we
           | haven't thought about don't exist. At least hard while also
           | preserving the property of being a Turing machine.
        
             | nextaccountic wrote:
             | I know. I mean applying formal methods to this specific
             | problem of proving side channels don't exist (which seems a
             | very hard thing to do and might even require to modify the
             | whole design to be amenable to this analysis)
        
               | less_less wrote:
               | As a tidbit, this was part of how one of the teams
               | involved in the original Spectre paper found some of the
               | vulnerabilities. Basically the idea was to design a small
               | CPU that could be formally shown to be free of certain
               | timing attacks. In the process they found a bunch of
               | things that would have to change for the analysis to
               | work... maybe in a small system those wouldn't _actually_
               | lead to vulnerabilities, but they couldn 't prove it (or
               | it would require lots of careful analysis). And in big
               | systems, those features do lead to vulnerabilities.
        
               | nextaccountic wrote:
               | That's amazing!
               | 
               | Do you have some link about the designed CPU?
        
         | akoboldfrying wrote:
         | Well, the bug in this specific case (based on the article by
         | Tavis O. linked elsewhere in comments) looks to be the regular
         | kind -- probably an off-by-one in a microcode edge case. That
         | is, here it's _not_ the case that the CPU functions correctly
         | but leaves behind traces of things that should be private in
         | timing side channels, as was the case for Spectre.
        
           | trebligdivad wrote:
           | Yeh just a fun bug rather than anything too fundamental.
           | Still, it is a fun bug.
        
         | JohnBooty wrote:
         | Is the future leads to a swarm of disconnected A55
         | cores each running a single application?
         | 
         | don't you dare tease me like that
        
           | bobim wrote:
           | And programmed in... Forth!
        
         | lmm wrote:
         | > Is it even possible to design a cpu with out-of-order and
         | speculative execution that would have no security issue?
         | 
         | Yes, of course. But we'd have to put actual effort in, and
         | realistically people wouldn't pay enough extra to make it
         | worthwhile.
        
       | tasty_freeze wrote:
       | Benchmarking is always problematic -- what is a good
       | representative workload? All the same, I'd be curious if the
       | ucode update that plugs this bug has affected CPU performance,
       | eg, it diverts the "fast short rep move" path to just use the
       | "bad for short moves but great for long moves" version.
        
         | akoboldfrying wrote:
         | In the article by Tavis O. linked elsewhere in comments, he
         | suggests disabling the FSRM CPU feature _only as an expensive
         | workaround_ to be taken only if the microcode can 't be updated
         | for some reason. That suggests to me that he, at least, expects
         | the update to do better.
        
         | ReactiveJelly wrote:
         | That would be the conservative thing to do. If there's no limit
         | on microcode updates, if I was Intel, I'd consider doing that
         | first and then speeding it up again later. Based on the
         | 5-second guess that people who update everything regularly will
         | care that we did the right thing for security, and people who
         | hate updates won't be happy anyway, so at least the first
         | update will be secure if they never get the next one.
         | 
         | (I think there is a limit on microcode, they seem conservative
         | to release new ones - I don't remember the details)
        
         | kevincox wrote:
         | It's a shame that Google didn't publish numbers. They have very
         | good profiling across all of their servers and probably have
         | incredibly high confidence numbers for the real-world impact on
         | this. (Assuming that your world is lots of copying protocol
         | buffers in C++ and Java)
        
       | writeslowly wrote:
       | I noticed the Intel advisory [1] says the following
       | 
       | Intel would like to thank Intel employees:[...] for finding this
       | issue internally.
       | 
       | Intel would like to thank Google Employees: [...] for also
       | reporting this issue.
       | 
       | [1] https://www.intel.com/content/www/us/en/security-
       | center/advi...
        
         | narinxas wrote:
         | I wonder how much sooner than google did intel employees found
         | this issue
        
           | narinxas wrote:
           | but what I am really wondering about is how much money (if
           | any) was the vulnerability worth up the moment when google
           | also discovered this?
        
             | ajross wrote:
             | As described it's just a CPU crash exploit that requires
             | local binary execution. Getting to a vulnerability would
             | require understanding exactly how the corrupted microcode
             | state works, and that seems extremely difficult outside of
             | Intel.
             | 
             | So as described, this isn't a "valuable" bug.
        
               | derefr wrote:
               | This assumes that either 1. partners and interested
               | sponsor-state state actors aren't kept abreast Intel's
               | microcode backend architecture, or 2. that there hasn't
               | been at least one leak of this information from one of
               | these partners into the hands of interested APT
               | developers. I wouldn't put strong faith in either of
               | these assumptions.
        
               | ajross wrote:
               | It does, but the same is true for virtually any such
               | crash vulnerability. The question was whether this was a
               | "valuable exploit", not whether it might theoretically be
               | worse.
               | 
               | The space of theoretically-very-bad attacks is much
               | larger than practical ones people will pay for, c.f.
               | rowhammer.
        
               | ethbr1 wrote:
               | >> _Getting to a vulnerability would require
               | understanding exactly how the corrupted microcode state
               | works, and that seems extremely difficult outside of
               | Intel._
               | 
               | Intel knows exactly how their ROB works.
               | 
               | Therefore Intel knows the possible consequences of this
               | bug and how to trigger them.
               | 
               |  _If_ there is a privilege execution path from this,
               | Intel knows. And anyone Intel chose to share it with
               | knew.
               | 
               | Thankfully, since it's public now, the value of that
               | decreases and customers can begin to mitigate.
        
               | ajross wrote:
               | > If there is a privilege execution path from this, Intel
               | knows. And anyone Intel chose to share it with knew.
               | 
               | No, or at least not yet. I mean, I've written plenty of
               | bugs. More than I can count. How many of them were
               | genuine security vulnerabilities if properly exploited?
               | Probably not zero. But... I don't know. And I wrote the
               | code!
        
               | saagarjha wrote:
               | Intel said it can be used for escalation if that answers
               | your question.
        
               | lmm wrote:
               | Did they confirm that it can definitely be used for
               | escalation? The description I saw was "may allow an
               | authenticated user to potentially enable escalation of
               | privilege and/or information disclosure and/or denial of
               | service via local access" which sounds like they're
               | covering all their bases and may not actually know what
               | is and isn't possible.
        
               | dgacmu wrote:
               | It's not super-valuable yet, but it would keep you mount
               | a really nasty DoS on cloud providers by triggering hard
               | resets of the physical machines. Some people would
               | probably pay for that, though it's obviously more
               | interesting to push on privilege or exfiltration.
               | 
               | Particularly since the MCEs triggered could prevent an
               | automatic reboot. Would depend what the hardware
               | management system did - do machines presenting MCEs get
               | pulled?
        
               | toast0 wrote:
               | If I'm a cloud provider and somebody's workflow is hard
               | resetting lots of my physical machines, I'm going to give
               | them free access to single tenant machines at the very
               | minimum. If they keep crashing the machines that only
               | they run on, I guess that's ok.
        
               | dgacmu wrote:
               | You can exploit this from a single core shared instance.
               | 
               | So you go and find yourself a thousand cheap / free tier
               | accounts, spin up an instance in a few regions each, and
               | boom, you've taken out 10k physical hosts. And run it in
               | a lambda at the same time, and see how well the security
               | mechanisms identify and isolate you.
               | 
               | Causing a near simultaneous reboot of enough hosts is
               | likely to take other parts of the infrastructure down.
        
               | ajross wrote:
               | I'm curious what part of this scheme involves "not ending
               | up in jail"? Needless to say you can't do this without
               | identifying yourself. To make this an exploitable DoS
               | attack you need to be able to run arbitrary binaries on a
               | few thousand cloud hosts _that you didn 't lease
               | yourself_.
        
               | blibble wrote:
               | there exist people outside of your jurisdiction
               | 
               | e.g. the GRU
        
               | TeMPOraL wrote:
               | So Replit, Godbolt, and whatever other cloud-hosted
               | compilers are there?
        
               | mschuster91 wrote:
               | > I'm curious what part of this scheme involves "not
               | ending up in jail"? Needless to say you can't do this
               | without identifying yourself.
               | 
               | Stolen credit cards are a dime a dozen, and nation state
               | actors can just use their domestic banks or agents in the
               | banks of other countries in a pinch to deflect blame or
               | lay false trails.
               | 
               | If I were Russia or China, I'd invest _a lot_ of money
               | into researching all kinds of avenues on how to take out
               | the large three public cloud providers if need be: take
               | out AWS, Google, Microsoft and on the CDN side Cloudflare
               | and Akamai and suddenly the entire Western economy grinds
               | to a halt.
               | 
               | The only ones who will not be affected are the US
               | government cloud services in AWS, as this runs separate
               | from other AWS regions - that is, unless the attacker
               | gets access to credentials that allow them executions on
               | the GovCloud regions...
        
               | vbezhenar wrote:
               | If clouds use shared servers to run their management
               | workloads and if very important companies use shared
               | servers to run their workloads, they would deserve it.
               | 
               | But I don't believe it. People are not that stupid.
        
               | mschuster91 wrote:
               | > If clouds use shared servers to run their management
               | workloads and if very important companies use shared
               | servers to run their workloads, they would deserve it.
               | 
               | Why target the management plane? Fire off payloads to
               | take down the physical VM hosts and suddenly any cloud
               | provider has a serious issue because the entire compute
               | capacity drops.
        
               | ajross wrote:
               | > If I were Russia or China, I'd invest a lot of money
               | into researching all kinds of avenues on how to take out
               | the large three public cloud providers
               | 
               | This subthread started with "is this issue a valuable
               | exploit". Needless to say, if you need to invoke
               | superpower-scale cyber warfare to find an application,
               | the answer is "no". Russia and China have plenty of
               | options to "take out" western infrastructure if they're
               | willing to blow things up[1] at that scale.
               | 
               | [1] Figuratively and literally
        
               | dgacmu wrote:
               | Countries have proven far more reticent to use kinetic
               | options vs. cyberattacks. Or, put differently, we're all
               | hacking each other left and right and the responses have
               | thus far mostly remained in the digital realm.
               | 
               | See, e.g., https://madsciblog.tradoc.army.mil/156-what-
               | is-the-threshold...
               | 
               | > responses are usually proportional to and in the same
               | domain as the provocation
        
               | mschuster91 wrote:
               | > Or, put differently, we're all hacking each other left
               | and right and the responses have thus far mostly remained
               | in the digital realm.
               | 
               | Which is both good and bad at the same time. Cyber
               | warfare has been significantly impacting our economies
               | and our citizens - anything from scam callcenters over
               | ransomware to industrial espionage - to the tune of many
               | dozens of billions of dollars a year. And yet, no Western
               | government has ever held the bad actors publicly
               | accountable, which means that they will continue to be a
               | drain on our resources at best and a threat to national
               | security at worst (e.g. the Chinese F-35 hack).
               | 
               | I mean, I'm not calling for nuking Bejing, that would be
               | disproportionate - but even after all that's happened,
               | Russia and China are still connected to the global
               | Internet, no sanctions, nothing.
        
               | blibble wrote:
               | it's not superpower-scale
               | 
               | some bored kid with a couple of hundred stolen credit
               | cards can bring down a significant chunk of AWS/GCP/...
        
               | dgacmu wrote:
               | I mean, you kinda can. There's a depressingly thriving
               | market for stolen cards and things like compromised
               | accounts. A card is a couple of dollars. There are many
               | jurisdictions that turn a blind eye to hacking us
               | companies. Look at how hard it's been to rein in the
               | ransomware gangs and even 'booter' (ddos-for-rent)
               | services.
               | 
               | DoS isn't as lucrative as other things; I assume that
               | most state actors would far prefer to find a way to turn
               | this into a privilege escalation. But being able to
               | possibly take out a cloud provider for a while is still
               | monetizable.
        
               | sweetjuly wrote:
               | The blogpost describes that unrelated sibling SMT threads
               | can become corrupted and branch erratically. If you can
               | get a hypervisor thread executing as your SMT sibling and
               | you can figure out how to control it (this is not an if
               | so much as a when), that's a VM escape. The Intel
               | advisory acknowledges this too when they say it can lead
               | to privilege escalation. This is hardly a useless bug, in
               | fact it's awfully powerful!
        
       | jefc1111 wrote:
       | This was a lot more fun than the Google puff piece.
        
       | frontalier wrote:
       | The date on the article is for tomorrow?
        
         | bitwize wrote:
         | Cereal Killer: Check this out, it's a memo about how they're
         | gonna deal with those oil spills on the 14th.
         | 
         | Acid Burn: What oil spills?
         | 
         | Lord Nikon: Yo, brain dead, today's the 13th.
         | 
         | Cereal Killer: Whoa, this hasn't happened yet!
        
       | quietpain wrote:
       | ...our validation pipeline produced an interesting assertion...
       | 
       | What is a validation pipeline?
        
         | tonfa wrote:
         | The blog has a link to
         | https://lock.cmpxchg8b.com/zenbleed.html#discovery which
         | presents the concept.
        
         | ForkMeOnTinder wrote:
         | It's described one paragraph earlier.
         | 
         | > I've written previously about a processor validation
         | technique called Oracle Serialization that we've been using.
         | The idea is to generate two forms of the same randomly
         | generated program and verify their final state is identical.
        
           | 1f60c wrote:
           | Sounds like the real story should be that Google solved the
           | halting problem. :-P
        
             | kadoban wrote:
             | You're free to solve the halting problem for restricted
             | sets of programs, that doesn't break any rules of the
             | universe.
             | 
             | They also could be just discarding any where it runs for
             | longer than X time, or a bunch of other possibilities.
        
               | tgv wrote:
               | They might be generating programs that they know will
               | halt. Like: applications with finite loops and such.
               | There are not enough details.
        
       | mike_d wrote:
       | The most awesome part:
       | 
       | > This bug was independently discovered by multiple research
       | teams within Google, including the silifuzz team and Google
       | Information Security Engineering.
        
       | yodon wrote:
       | Dupe: https://news.ycombinator.com/item?id=38268043
       | 
       | (As of this writing, this post has more votes, the other has more
       | comments)
        
         | dang wrote:
         | We'll merge that one hither. Please stand by!
        
       | blauditore wrote:
       | Can someone give a TL;DR for non-CPU experts? All technical
       | articles seem pretty long and/or complex.
        
         | Arnavion wrote:
         | Some x86 instructions can have prefixes that modify their
         | behavior in a meaningful way. Such a prefix can be applied
         | generally to any instruction, but it's expected to have no
         | effect when applied to an instruction it doesn't make sense
         | with. But it turns out the CPU actually misbehaves in some
         | cases when this is done. Intel released a CPU firmware update
         | to fix it.
        
         | kmeisthax wrote:
         | x86 has a builtin memory copy instruction, provided by the
         | combination of the movsb instruction and a rep _prefix byte_ ,
         | that says you want the instruction to run in a loop until it
         | runs out data to copy. This is "rep movsb". This instruction is
         | fairly old, meaning a lot of code still has it, even though
         | there's faster ways to copy memory in x86.
         | 
         | Intel added two features to modern x86 chips that detects rep
         | movsb and accelerates it to be as fast as those other ways.
         | However, those features have a bug. You see, because rep is a
         | prefix byte, you can just keep adding more prefix bytes to the
         | instruction (up to a maximum of 16 AFAIK). x86 has other prefix
         | bytes too, such as rex (used to access registers 8-16), vex,
         | evex, etc. The part of the processor that recognizes a rep
         | movsb does NOT account for these other prefix bytes, which
         | makes the processor get confused in ways that are difficult to
         | understand. The processor can start executing garbage, take the
         | wrong branch in if statements, and so on.
         | 
         | Most disturbingly, when multiple physical cores are executing
         | these "rep rep rep rep movsb" instructions at the same time,
         | they will start generating machine check exceptions, which can
         | at worst force a physical machine reboot. This is very bad for
         | Google because they rent out compute time to different
         | companies and they all need to be able to share the same
         | machine. They don't want some prankster running these
         | instructions and killing someone else's compute jobs. We call
         | this a "Denial of Service" vulnerability because, while I can't
         | read someone else's computations or change them, I _can_ keep
         | them from completing, which is just as bad.
        
           | BlueTemplar wrote:
           | > they all need to be able to share the same machine
           | 
           | Do they ? As these issues keep piling up, it just seems that
           | it's not worth the hassle, and they should instead never do
           | sharing like this...
        
             | jrockway wrote:
             | To some extent, anyone with a web browser is sharing their
             | machine with other people. That's Javascript.
             | 
             | If you ever download untrustworthy code and run it in a VM
             | to protect your main set of data, that's another case.
             | 
             | The success of cloud computing is from the idea that
             | multiple people can share the same computer. You only need
             | one core, but CPUs come with 128, but with the cloud you
             | can buy just that one core and share 1/128th of the power
             | supply, rack space, motherboard, ethernet cable, sysadmin
             | time, etc. and that reduces your costs. That assumption is
             | all based on virtualization working, though; nobody wants
             | 1/128th of someone else's computer, they want their own
             | computer that's 1/128th as fast. Bugs like these
             | demonstrate that you're just sharing a computer with
             | someone, which is bad for the business of cloud providers.
        
               | BlueTemplar wrote:
               | My point is that for a sufficiently large user, you can
               | probably use enough of the 128 cores by yourself alone,
               | that it's more worthwhile to do that and turn off these
               | mitigations : both because it removes a whole class of
               | threats, and also because the mitigations tend to have a
               | non-negligible performance impact, especially when first
               | discovered, on chips that haven't been designed to
               | protect against them.
        
               | jrockway wrote:
               | I very much agree with that. The reality is that cloud
               | providers can replace entire machines with only a small
               | latency blip in your application (or at least GCP can),
               | so if you are doing things like buying 2 core VMs 64
               | times to avoid losing more than 1% capacity when a
               | machine dies, you probably don't actually need to do
               | that. You could get a 128 core dedicated machine, and
               | then not share it with anyone, and your availability time
               | in that region/AZ probably wouldn't change much.
               | 
               | That said, machines are really monstrously huge these
               | days, and it can be hard to put them to good use. You
               | also miss out on cost savings like burstable instances,
               | which rely on someone else using the capacity for the 16
               | hours a day when you don't need it. It's a balance, but
               | I'd say "just buy a computer" would be my starting point
               | for most application deployments.
        
             | kevincox wrote:
             | If you don't want to share GCP and AWS both offer ways to
             | rent machines that aren't shared with other users. But for
             | most people the cost isn't worth it because shared machines
             | work well enough and provide much better resource
             | utilization.
        
             | kmeisthax wrote:
             | So your argument is that everyone who wants to run a
             | WordPress blog should be paying $320/mo[0] to rent a whole
             | machine just so we can avoid one _specific_ kind of
             | security problem?
             | 
             | [0] Based on the cost to rent an EC2 Dedicated Host (a1
             | family). See https://aws.amazon.com/ec2/dedicated-
             | hosts/pricing/
        
       | rep_lodsb wrote:
       | The REX prefix is redundant for 'movsb', but not 'movsd'/'movsq'
       | (moving either 32- or 64-bit words, depending on the prefix).
       | That may have something to do with the bug, if there is any
       | shared microcode between those instructions?
        
       | ZoomerCretin wrote:
       | Intel is a known partner of the NSA. If Intel was intentionally
       | creating backdoors at the behest of the NSA, how would they look
       | different from this vulnerability and the many other discovered
       | vulnerabilities before it?
        
         | rep_lodsb wrote:
         | My guess is that it would be something that could be exploited
         | via JavaScript. And no JIT would emit an instruction like the
         | one that causes this bug.
        
         | thelittleone wrote:
         | But so is Google. It would be some very crafty theatrics if
         | it's all coordinated.
        
           | ZoomerCretin wrote:
           | Only the people inserting the backdoor or using it would need
           | to be bound by a National Security Letter's gag order. I
           | doubt anyone at Google (including those subject to NSL gag
           | orders) was made aware of this specific vulnerability.
           | 
           | # Google's commitment to collaboration and hardware security
           | 
           | ## As Reptar, Zenbleed, and Downfall suggest, computing
           | hardware and processors remain susceptible to these types of
           | vulnerabilities. This trend will only continue as hardware
           | becomes increasingly complex. This is why Google continues to
           | invest heavily in CPU and vulnerability research. Work like
           | this, done in close collaboration with our industry partners,
           | allows us to keep users safe and is critical to finding and
           | mitigating vulnerabilities before they can be exploited.
           | 
           | There's a tension between the NSA wanting backdoors and
           | service providers (CPU designers + Cloud hosting) wanting
           | secure platforms. It's possible that by employing CPU and
           | security researchers, Google can tip the scales a bit further
           | in their favor.
        
         | gosub100 wrote:
         | the backdoor would just be an encrypted stream of "random" data
         | flowing right out the RNG. there's some maxim of crypto that
         | encrypted data is indistinguishable from random bytes.
        
         | tedunangst wrote:
         | How would you distinguish this backdoor from one inserted by an
         | unknown partner of the NSA?
        
       | dang wrote:
       | Related: https://cloud.google.com/blog/products/identity-
       | security/goo...
       | 
       | (via https://news.ycombinator.com/item?id=38268043, but we merged
       | the comments hither)
        
       | quotemstr wrote:
       | If the problem really is that the processor is confused about
       | instruction length, I'm impressed that this problem can be fixed
       | in microcode without a huge performance hit: my intuition (which
       | could be totally wrong) is that computing the length of an
       | instruction would be something synthesized directly to logic
       | gates.
       | 
       | Actually, come to think of it, my hunch is that the uOP decoder
       | (presumably in hardware) is actually fine and that the microcoded
       | optimized copy routine is trying to infer things about the uOP
       | stream that just aren't true --- "Oh, this is a rep mov, so of
       | course I need to go backward two uOPs to loop" or something.
       | 
       | I expect Intel's CPU team isn't going to divulge the details
       | though. :-)
        
       | ShadowBanThis01 wrote:
       | Is what? Another useless title.
        
       | eigenform wrote:
       | I wonder which MCEs are being taken when this is triggered?
        
       | malkia wrote:
       | Konrad Magnusson from Paradox Interactive (Victoria 3) team found
       | something related to that and mimalloc ->
       | https://github.com/microsoft/mimalloc/issues/807
       | 
       | Not sure if fully related, but possibly.
        
         | saagarjha wrote:
         | Seems unlikely unless they somehow emitted redundant prefixes
        
           | lights0123 wrote:
           | The article mentions
           | 
           | > This fact is sometimes useful; compilers can use redundant
           | prefixes to pad a single instruction to a desirable alignment
           | boundary.
           | 
           | so I imagine that could happen under the right optimization
           | mode.
        
             | ithkuil wrote:
             | Why would a compiler prefer a redundant prefix over a nop
             | for alignment?
        
               | Vecr wrote:
               | It can be faster (at runtime).
        
               | ithkuil wrote:
               | so basically you're saying that the cpu frontend missed
               | the opportunity to ignore the 0x90 because it was an
               | actual instruction which would be converted into an
               | actual nop uop?
               | 
               | Is this still the case or modern intel CPUs optimize out
               | the nop in the frontend decoder?
        
               | Vecr wrote:
               | Some compiler writers thought that was the case, if [0]
               | is related to OP. I don't have a "modern" (after 6th gen)
               | Intel CPU to test it on, but note that most programs are
               | compiled for a relatively generic CPU.
               | 
               | [0]: https://github.com/microsoft/mimalloc/issues/807
        
               | rasz wrote:
               | tedunangst down in the comments linked
               | https://repzret.org/p/repzret/ :
               | 
               | "Looking in the old AMD optimisation guide for the then-
               | current K8 processor microarchitecture (the first
               | implementation of 64bit x86!), there is effectively
               | mention of a "Two-Byte Near-Return ret Instruction".
               | 
               | The text goes on to explain in advice 6.2 that "A two-
               | byte ret has a rep instruction inserted before the ret,
               | which produces the functional equivalent of the single-
               | byte near-return ret instruction".
               | 
               | It says that this form is preferred to the simple ret
               | either when it is the target of any kind of branch,
               | conditional (jne/je/...) or unconditional (jmp/call/...),
               | or when it directly follows a conditional branch.
               | 
               | Basically, when the next instruction after a branch is a
               | ret, whether the branch was taken or not, it should have
               | a rep prefix.
               | 
               | Why? Because "The processor is unable to apply a branch
               | prediction to the single-byte near-return form (opcode
               | C3h) of the ret instruction." Thus, "Use of a two-byte
               | near-return can improve performance", because it is not
               | affected by this shortcoming."
               | 
               | ...
               | 
               | " If a ret is at an odd offset and follows another
               | branch, they will share a branch selector and will
               | therefore be mispredicted (only when the branch was taken
               | at least once, else it would not take up any branch
               | indicator %2B selector). Otherwise, if it is the target
               | of a branch, and if it is at an even offset but not
               | 16-byte aligned, as all branch indicators are at odd
               | offsets except at byte 0, it will have no branch
               | indicator, thus no branch selector, and will be
               | mispredicted.
               | 
               | Looking back at the gcc mailing list message introducing
               | repz ret, we understand that previously, gcc generated:
               | nop, ret
               | 
               | But decoding two instructions is more expensive than the
               | equivalent repz ret.
               | 
               | The optimization guide for the following AMD CPU
               | generation, the K10, has an interesting modification in
               | the advice 6.2: instead of the two byte repz ret, the
               | three-byte ret 0 is recommended
               | 
               | Continuing in the following generation of AMD CPUs,
               | Bulldozer, we see that any advice regarding ret has
               | disappeared from the optimization guide."
               | 
               | TLDR: Blame AMD K8! First x64 CPU. This GCC optimization
               | is outdated and should only be used when specifically
               | optimizing for K8.
        
       | farhanhubble wrote:
       | This is such an interesting read, right in the league of
       | "Smashing the stack" and "row hammer". As someone with very
       | little knowledge of security I wonder if CPU designers do any
       | kind of formal verification of the microcode architecture?
        
         | saagarjha wrote:
         | Yes.
        
       | asylteltine wrote:
       | Interesting write up. The submission needs a better and more
       | accurate title though
        
       | atesti wrote:
       | I don't understand "ERMS" and "FSRM" and there seems to be
       | nothing good on google about it.
       | 
       | Are these just CPUID flags that tell you that you can use a rep
       | movsb for maximum performance instead of optimized SSE memcpy
       | implementations? Or is it a special encoding/prefix for rep movsb
       | to make it faster? In case of the later, why would that be
       | necessary? How does one make use of fsrm?
        
         | tommiegannert wrote:
         | Found this [1], which also links to the Intel Optimization
         | Manual [2].
         | 
         | Seems like ERMS was a cheaper replacement for AVX and FSRM was
         | a better version, for shorter blocks.
         | 
         | > Cheapest versions of later processors - Kaby Lake Celeron and
         | Pentium, released in 2017, don't have AVX that could have been
         | used for fast memory copy, but still have the Enhanced REP
         | MOVSB. And some of Intel's mobile and low-power architectures
         | released in 2018 and onwards, which were not based on SkyLake,
         | copy about twice more bytes per CPU cycle with REP MOVSB than
         | previous generations of microarchitectures.
         | 
         | > Enhanced REP MOVSB (ERMSB) before the Ice Lake
         | microarchitecture with Fast Short REP MOV (FSRM) was only
         | faster than AVX copy or general-use register copy if the block
         | size is at least 256 bytes. For the blocks below 64 bytes, it
         | was much slower, because there is a high internal startup in
         | ERMSB - about 35 cycles. The FSRM feature intended blocks
         | before 128 bytes also be quick.
         | 
         | [1] https://stackoverflow.com/a/43837564
         | 
         | [2]
         | http://www.intel.com/content/dam/www/public/us/en/documents/...
        
         | ithkuil wrote:
         | FSRM is just the name of a cpu optimization that affects
         | existing code.
         | 
         | Choosing an optimal instruction choice and scheduling can be
         | done statically during compile time or dynamically (via chosing
         | one of several library functions at runtime, or jitting).
         | 
         | In order to be able to detect which is the optimal instruction
         | scheduling at runtime you need to know the actual CPU. You
         | could have a table of all cpu models or you could just ask your
         | OS whether the CPU you run on has that optimization
         | implemented.
         | 
         | Linux had to be patched so that it can _report_ that a CPU does
         | implement that optimization.
         | 
         | https://www.phoronix.com/news/Intel-5.6-FSRM-Memmove
        
         | rwmj wrote:
         | The flags just tell you that, on this CPU, rep movsb is fast so
         | you don't need to use an SSE/AVX-optimized implementation.
        
       | tommiegannert wrote:
       | Nice find. That indeed sounds terrible for anyone executing
       | external code in what they believe to be sandboxes. Good thing it
       | can be patched (and AFAICT, it seems to be a good fix, rather
       | than a performance-affecting workaround.)
        
       | tazjin wrote:
       | Can we get a better title for this? "Reptar - new CPU
       | vulnerability" or something. I thought it was some random startup
       | ad until I picked up the name somewhere else.
        
         | weinzierl wrote:
         | If it is changed to what you suggested a question mark would be
         | warranted, because it is not yet clear what can be done with
         | this _" glitch"_ (as the article calls it).
        
           | Thorrez wrote:
           | Intel says
           | 
           | >A potential security vulnerability in some Intel(r)
           | Processors may allow escalation of privilege and/or
           | information disclosure and/or denial of service via local
           | access.
           | 
           | https://www.intel.com/content/www/us/en/security-
           | center/advi...
        
       | Borg3 wrote:
       | Uhm.. Why not padding using NOP ? Looks much more safer that
       | slapping around random prefixes.
        
         | muricula wrote:
         | Modern Intel CPUs I am led to believe that issuing nops is
         | actually slower than adding prefixes. I think there is work in
         | the backend updating retired instruction counters and other
         | state which still occurs for nops, but decoding prefixes
         | happens entirely in the front end.
         | 
         | When a nop truly is necessary you will see compilers and
         | performance engineers add prefixes to the nop to make it the
         | desired size.
        
       | krylon wrote:
       | This is very well written. I know little about assembly
       | programming and Intel's ISA, let alone their microarchitectures,
       | but I could follow the explanation and feel like I have a rough
       | understanding of what is going on here.
       | 
       | Does anyone know if AMD CPUs are affected?
        
       | purpleidea wrote:
       | In this new Intel microcode bug, Tavis writes:
       | 
       | "We know something strange is happening, but how microcode works
       | in modern systems is a closely guarded secret."
       | 
       | My question: How likely is it that this is an intentional bug
       | door that was added into the microcode by Intel and its
       | government partners?
       | 
       | I don't know enough about microcode and CPU's to be able to
       | answer this myself, so backed-up opinions welcome!
        
         | jsnell wrote:
         | 0%.
         | 
         | This isn't how anyone would backdoor a CPU. An actual backdoor
         | would be done via some instruction sequence that is basically
         | impossible to trigger by accident and hard to detect even when
         | triggered.
        
           | fsflover wrote:
           | Can you give an example of such sequence? Is it really so
           | easy to hide it given that the microcode can be decoded in
           | principle, https://news.ycombinator.com/item?id=32145324? Why
           | is hiding it in a "bug" a worse solution? Why you can't do
           | both?
        
             | jsnell wrote:
             | Here's a couple of plausible ways.
             | 
             | One is to make the condition for the backdoor trigger based
             | on multiple (unlikely) instructions in sequence. This bug
             | was triggered by a single instruction, so it would have
             | been a pretty easy case for fuzzing. If you need a sequence
             | of 10 specific instructions in a specific sequence, with no
             | kind of observable side-effects for getting just the first
             | 9 right so that nobody can do a guided search? That's not
             | going to be found just by random chance. It doesn't matter
             | _what_ those instructions are, as long as they 're not
             | something that would get generated by real compilers on
             | real programs.
             | 
             | The other is to make it dependent on the data rather than
             | just the static instructions. Like, what if you had the
             | SHA1 acceleration instructions trigger a backdoor iff the
             | output of the hash is a certain value? You could probably
             | even arrange for the backdoor to get triggered from managed
             | and sandboxed runtimes like Javascript, rather than needing
             | to get the victim to run native code. And somebody
             | triggering this by accident would be equivalent to a SHA1
             | preimage collision.
        
       ___________________________________________________________________
       (page generated 2023-11-15 23:01 UTC)