[HN Gopher] How I found a bug in Intel Skylake processors (2017)
       ___________________________________________________________________
        
       How I found a bug in Intel Skylake processors (2017)
        
       Author : vinnyglennon
       Score  : 228 points
       Date   : 2021-11-08 16:12 UTC (6 hours ago)
        
 (HTM) web link (gallium.inria.fr)
 (TXT) w3m dump (gallium.inria.fr)
        
       | facorreia wrote:
       | 2017.
        
       | [deleted]
        
       | lordnacho wrote:
       | The problem with bugs deep in the stack is that it is really time
       | consuming to establish that they are in fact as deep as they are.
       | 
       | I wrote a Swift iOS app once, and came across an issue with one
       | of the collection classes.
       | 
       | Of course, nobody thinks that the Swift libs will be wrong as a
       | first guess. So I worked through a number of hypotheses about my
       | own code, slowly stripping out pieces that I thought might
       | contain an error. And then combinations. I also tried reducing
       | the number of entries just to simplify the logs. This worked, but
       | of course you are not going to think that there's a library bug
       | affecting collections with size > 16, and it wasn't actually a
       | theory until I randomly decided to reduce the n. I also
       | discovered that it worked just fine on release but not debug, so
       | I thought maybe I have some race condition.
       | 
       | More and more stripping down occurred, until I eventually gave up
       | using my own project and just started a new one just to see about
       | the collection class. I did it for the sake of being thorough,
       | rather than actually thinking the lib had a bug in its debug
       | implementation. But lo and behold, when I managed to make it
       | reproducible and put it on SO, someone from Apple acknowledged
       | that they could also see it, and they fixed it.
       | 
       | Naturally if I'd gone direct to testing the lib I'd have saved a
       | huge amount of time, but I guess that's the tradeoff from the
       | most sensible heuristic: test your own code first, the bug is
       | there.
        
         | gh123man wrote:
         | > nobody thinks that the Swift libs will be wrong as a first
         | guess
         | 
         | This is highly dependent on which version of Swift you started
         | with! When Swift introduced the new substring API I hit a bug
         | where certain UTF-8 character sequences caused an index out of
         | bounds error internally. Unfortunately we learned this in
         | production when an entire organization couldn't launch our app
         | due to a string they were feeding through it.
         | 
         | That is how your trust in the standard libs is forever broken.
         | Library and compiler bugs were quite common in the Swift 1-3
         | days.
        
           | jcelerier wrote:
           | Yeah, over the course of my allegedly short career (I'm 29)
           | I've reported dozens of bugs against GCC, Clang, MSVC,
           | binutils, Qt, SDL, glibc, PortAudio, macOS and other
           | foundational stuff... I'm not saying I automatically assume
           | "toolchain bug", but my cutoff for seriously pondering "is it
           | a bug in $underlying_stuff" is around 30 minutes of "I really
           | can't see where in my code things were done wrong" and so far
           | this heuristic has consistently held...
        
         | cesarb wrote:
         | > but I guess that's the tradeoff from the most sensible
         | heuristic: test your own code first, the bug is there.
         | 
         | Also known as "select is not broken" (see for instance
         | https://blog.codinghorror.com/the-first-rule-of-
         | programming-...).
        
         | yjftsjthsd-h wrote:
         | Reminds me of: "It Is Never a Compiler Bug Until It Is"
         | (https://r6.ca/blog/20200929T023701Z.html ,
         | https://news.ycombinator.com/item?id=24636326). The bottom of
         | the modern stack is _really_ reliable, until it isn 't;)
        
           | Smoosh wrote:
           | Not just "the modern stack". I work mainframes and always
           | felt the IBM-supplied environment (compilers, transaction
           | processing systems, databases) was rock solid.
           | 
           | Then one day I discovered APARs were a thing.
           | 
           | https://www.ibm.com/support/pages/open-apars-ibm-products-
           | av...
        
         | twic wrote:
         | Similar story with a bug in the IBM JDK's implementation of
         | BigDecimal. Surely if anyone is going to get decimals right
         | it's IBM! Took us a long time to stop looking at our code.
         | 
         | (turns out that IBM do get decimals right if you're running on
         | z/Architecture, where the code diverts to some hardware-
         | accelerated fast path; just not on x86-64 machines used by
         | paupers like my project)
        
       | CalChris wrote:
       | Debian announcement
       | 
       | https://lists.debian.org/debian-devel/2017/06/msg00308.html
       | 
       | Ahrefs writeup
       | 
       | https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...
       | 
       | The Intel spec update still labels SKL150 as _No Fix_ but there
       | is a microcode update available. Dunno exactly what to make of
       | that distinction.
       | 
       | https://www.intel.com/content/www/us/en/processors/core/desk...
       | 
       | Can an x86 program detect whether this update has been applied?
       | Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
        
         | BeeOnRope wrote:
         | It was "fixed" in a microcode update by disabling the _loop
         | stream buffer_ (LSD) which is a special mode of operation for
         | very small loops where the instruction decoders and uop cache
         | in the CPU are shut down and the loop runs directly out of a
         | small cache*. Since the problem arose only when the LSD was
         | being used, in combination with hyperthreading and high byte
         | register use, this effectively avoids the problem.
         | 
         | Of course, disabling the LSD has some costs: CPUs use more
         | power and some loops are slower (though some are faster). These
         | updates are usually applied silently without user consent, so
         | you might quite surprised to find out that after a reboot your
         | computation kernel suddenly draws more power or has slowed down
         | or sped up.
         | 
         | > Can an x86 program detect whether this update has been
         | applied? Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
         | 
         | Yes. One way would be to check the microcode version (available
         | in /proc/cpuinfo on Linux, among other places), since the
         | version that introduced this fix is known.
         | 
         | Another way would be to run a small loop known to fit in the
         | LSD and then check a performance counter event which counts
         | uops delivered from the LSD, like lsd.uops. This counter is
         | always zero when the LSD is disabled (or realistically you
         | could just run _any_ substantial code and check the counter
         | since you always have some non-neglible portion of the uops
         | coming from the LSD). This is how I check it from the command
         | line in practice.
         | 
         | Finally, if you don't have easy access to the counters, you
         | could create a loop that has a significant performance
         | difference depending on whether it is coming from the LSD or
         | not. For example, a loop that crosses a 32-byte boundary will
         | run 2 or more cycles when using the decoder or uop cache, but
         | could run in 1 cycle in the LSD. Timing such a loop would give
         | you a strong indication about whether the LSD is enabled.
         | 
         | ---
         | 
         | * Specifically, the cache used is not a dedicated one, but
         | rather the IDQ (decoded instruction queue) is reused. This
         | queue holds uops and is normally fed by the decoders or the uop
         | cache on one end, and which feeds the allocation/rename engine
         | on the other. In LSD mode, this queue stops being a queue and
         | is instead used as a kind of cache with the loop operations
         | "locked down" in the queue and just repeatedly replayed.
        
           | kaladin-jasnah wrote:
           | Dumb question, but why is it abbreviated as LS_D_ when it's
           | spelled loop stream _b_uffer?
        
             | CalChris wrote:
             | It's actually spelled _Loop Stream Detector_ and it dates
             | to the _Core 2_ processor family which is circa 2006. The
             | LSD is described in section 3.4.2.4 of the Intel
             | Optimization Manual, _Optimizing the Loop Stream Detector
             | (LSD)._ AnandTech describes how it works.
             | 
             | https://www.anandtech.com/show/2594/4
        
               | BeeOnRope wrote:
               | Yeah that's right. Not sure where I picked up the term
               | "... buffer" but a search shows I've been using it for a
               | while.
        
       | 13of40 wrote:
       | > More experienced programmers know very well that the bug is
       | generally in their code: occasionally in third-party libraries;
       | very rarely in system libraries
       | 
       | This was the bane of my existence when I worked on testing
       | Windows years ago. New SDETs almost invariably fell into the trap
       | of assuming any automation error was a "test bug" instead of a
       | bug in OS code, even if the OS code in question was written last
       | week.
        
       | 1432132143 wrote:
       | really guys GFY you know what OEMs do, they disable many features
       | every time got some new bug. i.e undervolting now my thinkbook
       | fan is always on on my laptop 30* fan is on 29* fan is on can't
       | even undervold my cpu now. Realy thx
        
       | wging wrote:
       | Previous submission:
       | https://news.ycombinator.com/item?id=14686277
       | 
       | (This is not a complaint; I found the post interesting.)
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _I found a bug in Intel Skylake processors_ -
         | https://news.ycombinator.com/item?id=14686277 - July 2017 (99
         | comments)
        
       | [deleted]
        
       | bjarneh wrote:
       | > Binary search always fails? "The Java compiler is acting funny
       | today!"
       | 
       | :-)
        
       | Decabytes wrote:
       | I'm glad I'm just a pleb programmer, who never has done anything
       | so complicated that it would expose processor errata.
       | 
       | And even if I did, I wouldn't have the expertise to even figure
       | it out.
        
         | brokenmachine wrote:
         | Welcome to the 99.999999999%.
        
         | dfox wrote:
         | The issue there is that the hardware is full of totally absurd
         | bugs. If you target PC-like userspace or one of the two major
         | mobile platforms it is somebody else's job to shield you from
         | that. In general CPU level bugs are somewhat rare, but every
         | single platform vendor had shipped some kind of silicon that
         | contains peripherals that do not work as documented and only by
         | chance work with the reference driver implementation.
        
       | SavantIdiot wrote:
       | This is a scary place to be: the top-level debug resource for a
       | major project. It took almost two years to resolve, but was
       | already known as SKL150. Looking at the clang vs. gcc assembly
       | without knowledge of SKL150 would be literally impossible to
       | debug. GCC -O1 vs -O2 is a clue, but even with the asm diffs,
       | wth? Again, scary.
        
         | tinus_hn wrote:
         | The world is a scary place; this is basically the same as
         | rowhammer which is an issue in computers shipped today.
        
           | woodruffw wrote:
           | Unless I'm misunderstanding what you mean, this isn't really
           | like rowhammer at all -- it's a uarch/ucode bug, which is
           | effectively a programming error within the CPU. Rowhammer is
           | a physical flaw in how memory cells in DRAM are laid out, one
           | that can be triggered by memory access patterns independent
           | of CPU architecture and microarchitecture.
           | 
           | (There are also hundreds of errata like this one in every CPU
           | generation. They're _usually_ not easy to exploit, since they
           | cause system instability rather than disclosing secret
           | material or allowing unintended code execution.)
        
             | zsmi wrote:
             | > Rowhammer is a physical flaw in how memory cells in DRAM
             | are laid out
             | 
             | It's not really a flaw, more like a consequence of how
             | memory cells are laid out. I mean most people want lots of
             | bits in their DRAM. Maximizing this parameter necessitates
             | that some will be in close proximity.
        
               | woodruffw wrote:
               | To my (non-EE) mind, the flaw is the electrical leakage
               | between the cells. Tight packing is a consequence of
               | economic forces, but I assume there are also technical
               | solutions that allow for tight packing (but either offset
               | the performance or cost gains). Is that assumption wrong?
               | (Genuinely asking!)
        
               | tlb wrote:
               | DRAM cells also decay over time (~ 60 milliseconds), but
               | memory controllers have some logic to refresh every row
               | on a regular schedule so it's not an issue.
               | 
               | They should also have logic to refresh adjacent rows if
               | some number of consecutive accesses to a small group of
               | rows is detected. This is rare in normal workloads,
               | because those accesses normally come from cache. It's
               | lame of chipmakers to not fix this. The fix would
               | requires the DRAM controller (integrated into modern
               | CPUs) to know more about the internals of DRAMs than they
               | currently do.
        
               | zsmi wrote:
               | In theory DDR5/LPDDR5 added a controller command for
               | RowHammer mitigation but I haven't had time to research
               | it yet.
               | 
               | See: https://arxiv.org/pdf/2108.06703.pdf
        
               | zsmi wrote:
               | There was a good paper on it in 2014. [1] They describe
               | the RowHammer attack as: opening and closing (activation
               | and precharge) a DRAM row (aggressor row) at a high
               | enough rate (hammering) such that it can cause bit-flips
               | in physically nearby rows (victim row).
               | 
               | Colloquially, it's basically a change in voltage in one
               | place can indirectly cause a change in voltage in another
               | place via capacitive coupling. Capacitance increases
               | proportional to the inverse of the separating distance so
               | only in recent years have things shrunk to the size that
               | makes it an issue.
               | 
               | Since having less bits in DRAM is basically not an option
               | most mitigation techniques that I know of remove the
               | possibility of hammering: possibilities include the OS,
               | memory system controller, or DRAM controller changes.
               | 
               | [1] https://users.ece.cmu.edu/~yoonguk/papers/kim-
               | isca14.pdf
        
               | woodruffw wrote:
               | Much appreciated, thank you.
        
           | [deleted]
        
       | dimitrios1 wrote:
       | Apologies if this is off topic -- but I am constantly impressed
       | at some of the things I find that come from inria.fr. I first
       | came across them when learning OCaml. Seems to be a top notch
       | university.
        
         | woodruffw wrote:
         | Inria is a research institute, not a university. But they do
         | indeed do excellent work!
        
       | bruce343434 wrote:
       | The link called "6th Generation Intel(r) Processor Family -
       | Specification Update." 404's
        
       | userbinator wrote:
       | "gcc/clang/icc/msvc won't usually issue the affected opcode
       | pattern and it ends up being rare. SKL150 - Short loops using
       | both the AH/BH/CH/DH registers and the corresponding wide
       | register _may_ result in unpredictable system behavior. "
       | 
       | I think Intel should regression-test its CPUs using the decades
       | of demoscene productions out there, especially those in the
       | extreme-size-optimisation categories; testing with almost
       | exclusively "mainstream" compiler output is IMHO a bad idea and a
       | step down the path to "warranty void if VLC is used"
       | (https://news.ycombinator.com/item?id=7205759 )
        
       ___________________________________________________________________
       (page generated 2021-11-08 23:00 UTC)