[HN Gopher] A bug that doesn't exist on x86: Exploiting an ARM-o...
       ___________________________________________________________________
        
       A bug that doesn't exist on x86: Exploiting an ARM-only race
       condition
        
       Author : stong1
       Score  : 279 points
       Date   : 2021-10-26 04:47 UTC (18 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | agalunar wrote:
       | Great write-up!
       | 
       | There may be a typo in section 3:
       | 
       | > It will happily retire instruction 6 before instruction 5.
       | 
       | If memory serves, although instructions can execute out-of-order,
       | they retire in-order (hence the "re-order buffer").
        
         | stong1 wrote:
         | Nice catch. I fixed it. I should have said "execute" rather
         | than "retire".
        
         | colejohnson66 wrote:
         | You are correct. The retire unit ensures that all micro ops are
         | retired in order
        
       | vitus wrote:
       | I spent some time trying to figure out why the lock-free
       | read/write implementation is correct under x86, assuming a
       | multiprocessor environment.
       | 
       | My read of the situation was that there's already potential for a
       | double-read / double-write between when the spinlock returns and
       | when the head/tail index is updated.
       | 
       | Turns out that I was missing something: there's only one producer
       | thread, and only one consumer thread. If there were multiple of
       | either, then this code would be more fundamentally broken.
       | 
       | That said: IMO the use of `new` in modern C++ (as is the case in
       | the writer queue) is often a code smell, especially when
       | std::make_unique would work just as well. Using a unique_ptr
       | would obviate the first concern [0] about the copy constructor
       | not being deleted.
       | 
       | (If we used unique_ptr consistently here, we might fix the scary
       | platform-dependent leak in exchange for a likely segfault
       | following a nullptr dereference.)
       | 
       | One other comment: the explanation in [1] is slightly incorrect:
       | 
       | > we receive back Result* pointers from the results queue rq,
       | then wrap them in a std::unique_ptr and jam them into a vector.
       | 
       | We actually receive unique_ptrs from the results queue, then
       | because, um, reasons (probably that we forgot that we made this a
       | unique_ptr), we're wrapping them in another unique_ptr, which
       | works because we're passing a temporary (well, prvalue in C++17)
       | to unique_ptr's constructor -- while that looks like it might
       | invoke the deleted copy-constructor, it's actually an instance of
       | guaranteed copy elision. Also a bit weird to see, but not an
       | issue of correctness.
       | 
       | [0] https://github.com/stong/how-to-exploit-a-double-
       | free#0-inte...
       | 
       | [1] https://github.com/stong/how-to-exploit-a-double-
       | free#2-rece...
        
         | aydwi wrote:
         | > IMO the use of `new` in modern C++ (as is the case in the
         | writer queue) is often a code smell
         | 
         | As a naive practitioner of modern C++, I'd love it if you could
         | elaborate on this.
        
           | ddlutz wrote:
           | It's widely accepted you should almost always use some smart
           | pointer like unique_ptr or shared_ptr, and even sometimes the
           | the unpopular weak_ptr depending on your use case.
        
           | usefulcat wrote:
           | Whenever you use 'new', you have to decide what is going to
           | 'own' the newly allocated thing. You'll also have to remember
           | to call 'delete' somewhere.
           | 
           | Using unique_ptr/make_unique() or shared_ptr/make_shared()
           | automates lifetime management (obviates the need for a manual
           | 'delete') and makes the ownership policy explicit. They also
           | have appropriately defined copying behavior. For example:
           | struct Foo {             // lots of stuff here ...         };
           | struct A {             Foo* f = new Foo;             ~A() {
           | delete f; }         };                  struct B {
           | std::unique_ptr<Foo> f = std::make_unique<Foo>();
           | // no need to define a dtor; the default dtor is fine
           | };
           | 
           | For the destructor and the default constructor, compilers
           | will generate basically identical code for both A and B
           | above. If you try to copy a B, the compiler won't let you
           | because unique_ptr isn't copyable. However it won't stop you
           | from copying an A, even though as written (using the default
           | copy ctor) that's almost certainly a mistake and will likely
           | result in a double free in ~A().
        
         | PaulDavisThe1st wrote:
         | > Turns out that I was missing something:
         | 
         | Indeed. It's not safe under x86 either.
        
         | stong1 wrote:
         | Great points. I made some minor edits to address that and
         | clarify some things. Thanks!
        
       | 0xfaded wrote:
       | My first gen threadripper occasionally deadlocks in futex code
       | within libgomp (gnu implementation of omp). Eventually I gave up
       | and concluded it was either a hardware bug or a bug that
       | incorrectly relies on atomic behaviour of intel CPUs. I
       | eventually switched to using clang with its own omp
       | implementation and the problem magically disappeared.
        
       | amelius wrote:
       | Does the race condition exist when emulating x86 on Apple M1?
        
         | mmwelt wrote:
         | Apple added hardware support for x86 memory semantics.
         | 
         | https://news.ycombinator.com/item?id=28731534
         | 
         | https://mobile.twitter.com/ErrataRob/status/1331735383193903...
        
         | saagarjha wrote:
         | No. Rosetta emulates TSO correctly.
        
           | addaon wrote:
           | To draw together the two answers here to the original
           | question.
           | 
           | 1) Emulating an ISA includes emulating its memory model. As
           | saagarjha says, this means that Rosetta 2 must (and does)
           | correctly implement total store ordering.
           | 
           | 2) There are various ways to implement this. For emulators
           | that include a binary translation layer (that is, that
           | translate x86 opcodes into a sequence of ARM opcodes), one
           | route is to generate the appropriate ARM memory barriers as
           | part of the translation. Even with optimization to reduce the
           | number of necessary barriers, though, this is expensive.
           | Instead, as mmwelt mentions, Apple took an unusual route
           | here. The Apple Silicon MMU can be configured on a per-page
           | basis to use either the relaxed ARM memory model or the TSO
           | x86 memory model. There is a performance cost at the hardware
           | level for using TSO, and there is a cost in silicon area for
           | supporting both; but from the point of view of Rosetta 2, all
           | it has to do is mark x86-accessed pages as TSO and the
           | hardware takes care of the details, no software memory
           | barriers needed.
        
             | saagarjha wrote:
             | TSO is a per-thread flag; it's implemented as a write to a
             | MSR on context switch to switch it on or off.
        
       | cookiewill wrote:
       | Is it normal for the .got.plt section to be writable rather than
       | read-only?
        
       | anyfoo wrote:
       | Heh, 10 years ago I gave a presentation about how easy folks used
       | to x86 can trip up when dealing with ARM's weaker memory model.
       | My demonstration then was with a naive implementation of
       | Peterson's algorithm.[1]
       | 
       | I have a feeling that we will see a sharp rise of stories like
       | this, now that ARM finds itself in more places which were
       | previously mostly occupied by x86, and all the subtle race
       | conditions that x86's memory model forgave actually start
       | failing, in equally subtle ways.
       | 
       | [1] The conclusion for this particular audience was: Don't try to
       | avoid synchronization primitives, or even invent your own. They
       | were not system level nor high perf code programmers, so they had
       | that luxury.
        
         | gpderetta wrote:
         | But Peterson's algorithm requires explicit memory barriers even
         | on x86, it doesn't seem the best example to show the
         | difference.
        
           | anyfoo wrote:
           | Here are my slides from back then:
           | https://reinference.net/mp-talk.pdf
           | 
           | You made me wonder, because I definitely remember using
           | Peterson's Algorithm, so I went back to my slides and turns
           | out: I first showed the problem with x86, then indeed added
           | an MFENCE at the right place, and then showed how that was
           | not enough for ARM. So the point back then was to show how
           | weaker memory models can bite you with the example of x86,
           | and then to show how it can still bite you on ARM with its
           | even weaker model (ARMv7 at that time, and C11 atomics aren't
           | mentioned yet either, but their old OS-specific support is).
        
             | gpderetta wrote:
             | Oh, right, yes, ARM additionally needs a release barrier on
             | the unlock path.
        
         | stjohnswarts wrote:
         | I doubt that. The number of ARM processors is far greater in
         | reality than in x86 if we clarify it by saying "in operation"
         | rather than historically and these stories will become more
         | common but certainly won't see a "sharp increase".
        
           | codeflo wrote:
           | This sort of bug only happens when running a multithreaded
           | program (with shared memory) on a multicore processor.
           | 
           | You do need both for the problem to happen: Without shared
           | memory, there's nothing to exploit. And with a single core
           | only, you get time-sliced multithreading, which orders all
           | operations.
           | 
           | My point is, that combination was a lot rarer in ARM land
           | before people started doing serious server or desktop
           | computing with those chips.
        
           | retrac wrote:
           | Of course. Any such flaws in the Linux kernel or any library
           | used by Android should have been found by now, for example.
           | But the number of ARM processors running
           | developer/server/desktop stacks has been tiny until recently.
           | In my experience, quite a lot of Linux on desktop software
           | fails to even build on non x86_64 machines.
        
             | stjohnswarts wrote:
             | Are you kidding? Arm computers are by far the most common
             | over the past 10 years. Computers are everywhere and
             | servers and home computers account for at most 10% of the
             | market for cpus and microcontrollers.
        
               | masklinn wrote:
               | The vast majority of those are not running multitheaded
               | workloads written by complete randos.
        
           | SavantIdiot wrote:
           | The dominant Arm core in the world is a Cortex-M (or
           | Cortex-R) which are single-core. They are 99% of the time on
           | a die with far less <512K SRAM, and run an RToS or baremetal.
           | 
           | These outnumber x86+Cortex-A by probably a factor of 1,000.
        
             | rsynnott wrote:
             | There are almost certainly more multi-core ARM chips than
             | x86 chips around, too, tho.
        
               | SavantIdiot wrote:
               | That's a good question. I guess we would need to compare
               | all the arm multi-core in smartphones & some laptops, vs
               | all the intel laptops/desktops/servers. Hmmm... tough
               | one.
        
         | mwcampbell wrote:
         | > Don't try to avoid synchronization primitives, or even invent
         | your own.
         | 
         | Makes me wonder if it's really a good idea in most cases to
         | use, for example, the Rust parking_lot crate, which
         | reimplements mutexes and RW locks. Besides the speed boost for
         | uncontended RW locks, particularly on x64 Linux, what I really
         | like about parking_lot is that a write lock can be downgraded
         | to a read lock without letting go of the lock. But maybe I'd be
         | better off sticking with the tried-and-true OS-provided lock
         | implementations and finding another way to work around the lack
         | of a downgrade option.
        
           | cesarb wrote:
           | Unless you're the maintainer of the parking_lot crate, you're
           | not "inventing your own". And since parking_lot is AFAIK the
           | second most popular implementation of mutexes and RW locks in
           | Rust (the most popular one being obviously the one in the
           | Rust standard library, which wraps the OS-provided lock
           | implementations), you can assume it's well tested.
        
             | codeflo wrote:
             | Everything about which people tell you to "not invent your
             | own" must be invented by _someone_.
        
               | twodai wrote:
               | I guess it's more about priorities, if you want to build
               | a web app don't reinvent the underlying protocols. If you
               | want to build an embedded system for monitoring room
               | temperature don't invent your own locks.
               | 
               | But if you want to make a lock by all means make a lock,
               | just don't go and reinvent the chip architecture...
        
               | JulianMorrison wrote:
               | It's not "don't invent it".
               | 
               | It's "be competent before you invent it, because it's
               | hard". And if you aren't, then let someone who is do the
               | inventing.
        
               | convolvatron wrote:
               | the best way to _get_ competent is to try
               | 
               | so yes - invent your own synchronization primitives.
               | please.
               | 
               | just dont believe they are correct without being serious
               | about trying to prove they are. and dont hold up your
               | whole project for self-enrichment.
               | 
               | but try to layer as much in as you can.
               | 
               | developers these days are so productive, until they fall
               | down and cant get up. and then they are completely
               | useless.
        
               | api wrote:
               | In that case you should never run any software. Go live
               | as a hunter-gatherer in the wilderness.
        
               | oriki wrote:
               | I mean, that's too literal of an interpretation.
               | 
               | It's more like "if you don't need to, don't invent your
               | own [x]." People who like to invent [x] are usually smart
               | enough to understand why that warning is there to begin
               | with, and don't tend to argue with it.
        
               | pornel wrote:
               | You're taking a saying too literally.
               | 
               | People who take on task of writing such library, and
               | develop it to the point it's a language-wide standard,
               | usually know what they're doing (or learn on the job :)
               | 
               | Popularity of such library helps test it thoroughly on
               | many platforms in various conditions, so there's a high
               | chance the bugs will be spotted and fixed.
        
       | half-kh-hacker wrote:
       | this slaps. I always see perfect blue a few places above us!
        
         | nyanpasu64 wrote:
         | Context for downvoters: "perfect blue" is the CTF group writing
         | this article, and "a few places" means CTF team rankings in
         | competitions.
        
       | beebmam wrote:
       | Like quantum physics, memory ordering is deeply unintuitive (on
       | platforms like ARM). Unlike quantum physics, which is an
       | unfortunate immutable fact of the universe, we got ourselves into
       | this mess and we have no one to blame but ourselves for it.
       | 
       | I'm only somewhat joking. People need to understand these memory
       | models if they intend on writing atomic operations in their
       | software, even if they aren't currently targeting ARM platforms.
       | In this era, it's absurdly easy to change an an LLVM compiler to
       | target aarch64, and it will happen for plenty of software that
       | was written without ever considering the differences in atomic
       | behavior on this platform.
        
         | pclmulqdq wrote:
         | One piece of friction that hurts here is that the C++/Rust and
         | ARM memory models aren't the same, and the consequences of this
         | are unintuitive - compilers and CPUs can both screw with
         | execution ordering.
         | 
         | People who write in C++ should technically _only_ be concerned
         | with the C++ memory model, but x86 has let them be very lax and
         | undisciplined with std::memory_order_relaxed. ARM has some
         | alluring constructs that don't quite fit with the C++ model,
         | which can tempt you to mix and match memory models for
         | performance. All of this means trouble with atomics.
        
           | [deleted]
        
         | newpavlov wrote:
         | Memory ordering gets somewhat easier after you understand that
         | flat memory shared by execution units is a leaky abstraction
         | desperately patched over decades by layer and layers of
         | hardware and software. Memory ordering is one way to represent
         | message passing and synchronization between different cores and
         | RAM. This why I think that "lock-free algorithms" is a
         | misnomer, you still have synchronization, but you simply rely
         | on hardware for it.
        
           | uuidgen wrote:
           | > "lock-free algorithms" is a misnomer, you still have
           | synchronization,
           | 
           | Lock-free doesn't mean that there is no synchronization. It
           | is a way to synchronize memory access between threads from
           | the start. It means that there is no additional locking to
           | protect access to the shared resource - all read access is
           | valid, from any number of simultaneous write accesses at
           | least one succeeds (which is not true for some other
           | algorithms like network exponential backoff).
           | 
           | Even on x86 the most common instruction you use is LOCK
           | cmpxchg.
        
           | gpderetta wrote:
           | That's actually a common misconception. Memory ordering, on
           | the majority of common cpus, has nothing to do with
           | interprocessor communication or processor-ram communication.
           | Common memory coherency protocols (I.e. MESI and derivatives)
           | guarantee that all caches have a consistent view of memory.
           | 
           | Usually memory reordering is purely artifact of the way CPUs
           | access their private L1-cache.
        
             | yvdriess wrote:
             | For the record, this is false. It is conflating memory
             | coherency with consistency.
             | 
             | Nearly everything in a modern processor is a source of
             | reordering, from branch prediction to basically everything
             | in the OoO backend. Any time you leave the core, there's
             | reordering happening in the network. And yes, that includes
             | caches, which involve a heavy amount of inter-core
             | communication. When you issue two successive loads to
             | different cache lines, which one is going to return first?
             | 
             | The OoO backend itself manages hazards and ensures that
             | ld/st instructions are retired in the correct order to
             | maintain the processor's memory consistency model. Software
             | can build on top of that, e.g. with fences, to impose
             | stricter consistency models.
        
               | gpderetta wrote:
               | Sorry, what is false? You seem to be agreeing with me?
               | 
               | edit: to clarify, I claim that the coherency layer (i.e
               | the L1 cache and beyond), does not introduce any
               | reordering issues, at least for common cpus using
               | variants of MESI.
        
           | dooglius wrote:
           | Hardware synchronization has better properties than software
           | locks: it can't deadlock, is reentrant, won't get screwed up
           | by a process holding a lock dying, and is (supposedly)
           | guaranteed to complete in bounded time. I don't think it's
           | unreasonable that the definition of lock-free ("guaranteed
           | system-wide progress") focuses on the high-level behavior of
           | typical software locks even if it ends up calling things that
           | are still locks in some sense "lock-free".
        
             | hnfong wrote:
             | That's the benefit you get from having code that defaults
             | to a race condition instead of defaulting to deadlock.
             | 
             | (Which is better? I don't know.)
        
           | rendaw wrote:
           | Is there anything out there that exposes a better or tighter
           | abstraction? Something not flat?
        
             | yvdriess wrote:
             | A Dataflow architecture ISA would. It's been tried before.
             | But, working out the entire software stack from scratch is
             | a moonshot.
        
               | formerly_proven wrote:
               | High-performance processors are data flow processors,
               | which infer the data flow graph from the instruction
               | stream using Tomasulo's algorithm.
        
             | devit wrote:
             | There are systems like the PS3 SPEs and DSPs where you only
             | have normal access to local on-chip memory and have to
             | explicitly initiate DMA to access external memory.
             | 
             | But that's just bad for running general purpose software
             | that can require more memory than available locally since
             | it means you have to do memory cache management in software
             | which is going to be much slower than letting the hardware
             | do it.
        
             | formerly_proven wrote:
             | In practice you _want_ memory reordering to be a thing
             | because that 's what allows you to reorder instructions
             | that touch memory (both at compile time, and also at
             | runtime by the processor), which is what enables a large
             | part of the latency hiding that's going on.
        
       | drcongo wrote:
       | Nice try Intel.
        
       | pcwalton wrote:
       | Lock-free programming is really tough. There are really only a
       | few patterns that work (e.g. Treiber stack). Trying to invent a
       | new lock-free algorithm, as this vulnerable code demonstrates,
       | almost always ends in tears.
        
         | platinumrad wrote:
         | There's no new invention in here. Just an (intentional) misuse
         | of "volatile".
        
         | nyanpasu64 wrote:
         | IMO lock-free MP or MC algorithms are harder to get right than
         | SPSC structures (atomics for shared memory, queues for
         | messaging, triple buffers for tear-free shared memory). But
         | even SPSC algorithms can be tricky; I've found the same
         | (theoretical) ordering error in _three_ separate Rust
         | implementations of triple buffering (one of them mine), written
         | by people who 've already learned the ordering rules (which I
         | caught with Loom). And initially learning to reason about
         | memory ordering is a major upfront challenge too.
        
           | ohazi wrote:
           | I particularly like lock-free (wait-free?) SPSC queues
           | because they're (relatively) easy to get right, and are
           | _extremely_ useful for buffering in embedded systems. I end
           | up with something like this on almost every project:
           | 
           | One side of the queue is a peripheral like a serial port that
           | needs to be fed/drained like clockwork to avoid losing data
           | or glitching (e.g. via interrupts or DMA), and the other side
           | is usually software running on the main thread, that wants to
           | be able to work at its own pace and also go to sleep
           | sometimes.
           | 
           | An SPSC queue fits this use-case nicely. James Munns has a
           | fancy one written in Rust [1], and I have a ~100 line C
           | template [2].
           | 
           | [1] https://github.com/jamesmunns/bbqueue
           | 
           | [2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638
           | d70...
        
           | reitzensteinm wrote:
           | I'd be interested in knowing the details of the error!
        
             | nyanpasu64 wrote:
             | https://github.com/HadrienG2/triple-buffer/issues/14
        
         | xxs wrote:
         | There are tons of lock-free algorithms, both node based and
         | array backed up. Lock-free is notoriously easier on garbage
         | collector set-ups, of course.
        
         | PaulDavisThe1st wrote:
         | This isn't a new lock-free algorithm. Single-reader, single-
         | write FIFOs are one of the oldest approaches around.
         | 
         | They have to be tweaked when execution isn't guaranteed (by
         | using memory barriers). TFA is about an exploit based on code
         | that hasn't added the required memory barriers.
        
       | sydthrowaway wrote:
       | Any good references on low level details on ARMv8+?
        
       | gpderetta wrote:
       | The best part is that the original code is not safe even on x86
       | as the compiler can still reorder non-volatile accesses to the
       | backing_buf around the volatile accesses to head and tails.
       | Compiler barriers before the volatile stores and after volatile
       | reads are required [1]. It would still be very questionable code,
       | but it would at least have a chance to work on its intended
       | target.
       | 
       | tl;dr: just use std::atomic.
       | 
       | [1] it is of course possible they are actually present in the
       | original code and just omitted from the explanation for brevity
        
       | reitzensteinm wrote:
       | For those interested in memory ordering, I have a few posts on my
       | blog where I build a simulator capable of understanding
       | reorderings and analyze examples with it:
       | 
       | https://www.reitzen.com/post/temporal-fuzzing-01/
       | https://www.reitzen.com/post/temporal-fuzzing-02/
       | 
       | Next step are some lock free queues, although I haven't gotten
       | around to publishing them!
        
       | secondcoming wrote:
       | There is a proposal (possibly accepted) to deprecate 'volatile'
       | in C++.
       | 
       | http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...
        
         | tialaramex wrote:
         | Yes P1152 was taken for C++ 20.
         | 
         | The purpose of abolishing volatile isn't so much to reinforce
         | that it's not intended for this sort of threading nonsense
         | (indeed on Windows the MSVC guarantees mean it almost _is_
         | intended for this sort of nonsense) but to make it explicit
         | that  "volatile variables" were never really a thing anyway by
         | abolishing the volatile qualifier on variables.
         | 
         | The thing your hardware can actually do is almost exactly:
         | https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and
         | https://doc.rust-lang.org/core/ptr/fn.write_volatile.html
         | 
         | And sure enough that's equivalent to what is proposed for C++
         | although not in just this one paper.
         | 
         | With "volatile variables" you can use compound assignment
         | operators on the variable. What does that even mean? Nothing.
         | It means nothing, it's gibberish, but you can do it and people
         | do. They presumably _thought_ it meant something and since it
         | doesn 't they were wrong. So, deprecate this and maybe they'll
         | go read up on the subject.
         | 
         | You can also in C++ declare things that clearly aren't in the
         | least bit volatile, as volatile anyway. C++ has volatile member
         | variables, volatile member functions, volatile parameters...
         | Any code that seems to rely on this probably doesn't do what
         | the people who wrote it thought it does, run away.
        
           | mhh__ wrote:
           | volatile primitives is how D does volatile as well.
           | 
           | I do sort of miss having a basic volatile (although I can
           | write my own type somewhat effectively) just for
           | benchmarking's sake sometimes.
        
           | agent327 wrote:
           | >With "volatile variables" you can use compound assignment
           | operators on the variable. What does that even mean? Nothing.
           | 
           | It means exactly the same thing as on a normal variable, and
           | it boggles the mind that people somehow not understand that.
           | Given 'volatile int i', 'i++' means the exact same thing as
           | 'i = i + 1'. Does that also not make any sense to you? If it
           | does, can you explain why you believe they are different?
           | 
           | Volatile member functions and parameters make no sense, but
           | volatile member variables most certainly do. And there is
           | considerable pushback in the C++ community because this is a
           | significant loss of compatibility with various C-headers used
           | frequently in embedded applications. I wouldn't be surprised
           | if the deprecated features will be reinstated in the language
           | in the end.
        
             | gpderetta wrote:
             | Yes, it will likely be reverted [1] for bitwise ops, while
             | keeping += and ++ deprecated (to save face someone would
             | say).
             | 
             | [1] http://www.open-
             | std.org/jtc1/sc22/wg21/docs/papers/2021/p232...
        
               | tialaramex wrote:
               | It's true that the relevant committee is now likely to
               | propose this be changed in C++ 23 (or before as an
               | errata)
               | 
               | But the face saving is on the part of embedded C++
               | proponents who'd rather _never fix the C++ language_ than
               | risk that the mundane C code cheap vendors actually ship
               | can 't be counted as "C++ support" because it's no longer
               | valid C++.
               | 
               | That's the core of the argument in P2327. Not that this
               | isn't a bad idea (it's obviously a bad idea) or that we
               | shouldn't do better (alternatives to C++ already do
               | better) but that this is the status quo and C++ can't
               | expect to improve upon that. A sad state of affairs.
               | 
               | P2327's dismissal of the problem caused by UART1->UCSR0B
               | |= (1<<UCSZ01 ); comes down to the usual Real Programmer
               | stance, surely every single embedded programmer will
               | arrange "by construction" that the problem can't happen.
               | No actual examples were examined to see if embedded
               | developers reliably do that, which seems unlikely - we're
               | just invited to imagine that they're all inerrant.
        
             | vlovich123 wrote:
             | Not sure what will actually happen, but you could easily
             | allow volatile variables with if they have extern "C"
             | linkage for compatibility with C headers, while deprecating
             | it elsewhere.
        
               | agent327 wrote:
               | That's not really what extern "C" does though. It doesn't
               | change the language rules or the parsing or anything, it
               | only changes how symbols are represented in the object
               | file. Extern "C" means they are mangled using C rules
               | (that mostly involves adding an underscore); otherwise
               | it's gonna be C++ rules (and loads more information needs
               | to be encoded).
               | 
               | Some people argue for a similar mechanism, something like
               | 'language "C++20" { .. }', that would allow a program to
               | opt in to changes that would otherwise be breaking
               | changes; mostly new keywords. However, changing actual
               | language rules in such a block would be tricky, to say
               | the least.
        
             | tialaramex wrote:
             | > It means exactly the same thing as on a normal variable,
             | and it boggles the mind that people somehow not understand
             | that.
             | 
             | If you just want normal variables, write a normal variable,
             | J F. Bastien's change doesn't have any consequences for
             | your normal variables.
             | 
             | > Given 'volatile int i', 'i++' means the exact same thing
             | as 'i = i + 1'
             | 
             | No. That's the post-increment operator so i++ has the same
             | value as i did before the increment, whereas i = i + 1 has
             | the value of i _after_ the increment.
             | 
             | And you might say "Oh, but I never use those values". Good.
             | J.F. Bastien's change forbids that for volatiles too, we're
             | on the same page.
             | 
             | What you apparently want is to do two volatile operations,
             | one read and one write, sandwiching an arbitrary integer
             | operation. You can do that explicitly of course, Bastien's
             | change doesn't alter that - but when you write it out,
             | doubt appears on the horizon. If these are two operations,
             | can't I race with somebody else such that they modify this
             | thing before I write my modified version back and then I
             | overwrite their changes?
             | 
             | And that doubt should be there. But what we know from
             | asking people is that the compound operators _falsely_
             | reassure them. That 's why they were deprecated and why
             | it's sad that there's enthusiasm to bring them back.
        
       | PaulDavisThe1st wrote:
       | Either I'm not understanding something that I thought I
       | understood very well, or TFA's author's don't understand
       | something that they think they understand very well.
       | 
       | Their code is unsafe even on x86. You cannot write a single-
       | writer, single-reader FIFO on modern processors without the use
       | of memory barriers.
       | 
       | Their attempt to use "volatile" instead of memory barriers is not
       | appropriate. It could easily cause problems on x86 platforms in
       | just the same way that it could on ARM. "volatile" does not mean
       | what you think it means; if you're using it for anything other
       | than interacting with hardware registers in a device driver,
       | you're almost certainly using it incorrectly.
       | 
       | You must use the correct memory barriers to protect the
       | read/write of what they call "head" and "tail". Without them, the
       | code is just wrong, no matter what the platform.
        
         | stong1 wrote:
         | Right. The challenge is written incorrectly on purpose,
         | otherwise the code isn't vulnerable. The use of volatile is a
         | bit of a misdirection for the CTF players, since you're right
         | that it's a common misconception that volatile acts like a
         | barrier.
         | 
         | > You cannot write a single-writer, single-reader FIFO on
         | modern processors without the use of memory barriers.
         | 
         | I am not sure about this. From my understanding, on x86, given
         | the absence of compiler reordering, processor reordering should
         | not cause a problem for a single-reader-single-writer FIFO.
         | Normally I just use atomics but I think in this specific
         | instance it should still be okay anyways. Obviously it will not
         | work on ARM.
         | 
         | From my testing if you compile the code on x86 with clang or
         | gcc, the resulting binary is not vulnerable.
        
           | gpderetta wrote:
           | Without compiler fences in the right place [1] GCC and clang
           | can miscompile the code even on x86. Doesn't mean they will
           | of course.
           | 
           | [1] see the linux kernel implementation of load acquire and
           | store release on x86 for example.
        
         | hvdijk wrote:
         | The point is that the bug is unexploitable on x86 because
         | although the source code may have a bug, on x86 it gets
         | compiled to machine code that does not. That's the thing with
         | undefined behaviour, sometimes it does work exactly as you
         | expect, which can make it so tricky to nail down.
        
         | kloch wrote:
         | > "volatile" does not mean what you think it means; if you're
         | using it for anything other than interacting with hardware
         | registers in a device driver, you're almost certainly using it
         | incorrectly.
         | 
         | Another "correct" use of volatile is a hack to prevent
         | compilers from optimizing away certain code. It's pretty rare
         | to need that and often you can just use a lower optimization
         | level (like the usual -O2) but sometimes you need -O3 / -Ofast
         | or something and a strategic volatile type def to keep
         | everything working.
         | 
         | A classic example is Kahan summation algorithim. At -O2 it's
         | fine. At -O3 or higher it silently defeats the algorithm while
         | appearing to work (you get a sum but without the error
         | compensation). Defining the working vars as volatile makes it
         | work again. This is noted in the wikipedia pseudocode with the
         | comment "// Algebraically, c should always be zero. Beware
         | overly-aggressive optimizing compilers!"
         | 
         | https://en.wikipedia.org/wiki/Kahan_summation_algorithm
         | 
         | Of course -O3 might not be any faster anyway but that's another
         | topic.
        
           | vlovich123 wrote:
           | I can't imagine it's an O2 vs O3 thing unless a compiler
           | enables "fast-math" optimization to allow associativity.
           | Neither clang nor GCC do this (neither does MSVC I think) -
           | optimization levels never silently turn off IEEE754 floating
           | point. I don't know about ICC but it sounds like they
           | stupidly enable fast math by default to try to win at
           | benchmarks.
           | 
           | Do you have anything to actually support this statement or
           | did you just assume "overly aggressive optimizing compilers"
           | and "O3" are somehow linked?
           | 
           | Generally optimization levels may find more opportunities to
           | exploit UB, but they do not change the semantics of the
           | language, and all languages I'm familiar with define floating
           | point as a non-associative operation because it's not when
           | you're working with finite precision.
           | 
           | TLDR: Don't use volatile unless you really know what you're
           | doing, and unless you know C/C++ really well, you probably do
           | not. If anyone tells you to throw in a volatile to "make
           | things work", it's most likely cargo curling bad advice (not
           | always, but probably).
        
             | gpderetta wrote:
             | there is some amount of truth on what the parent is saying.
             | Ages ago, when x86 only had x87 FP, gcc would program the
             | FPU to use 80 bit precision even when dealing with doubles.
             | The excess precision meant that GCC could not implement
             | IEEE math correctly even without fast-math. Forcing the
             | storing of intermediate values into memory via volatile
             | variables was a partial solution to this problem.
             | 
             | MSVC configures the FPU to use 64 bit precision which means
             | that double words fine, but it has no 80 bit long double
             | and float still suffer from excess precision.
             | 
             | SSE avoid all these problems.
        
               | vlovich123 wrote:
               | Kind of, but that still shouldn't have impacted Kahan
               | summation, which only cares about associativity, and
               | extended precision doesn't impact that. They would just
               | end up getting more numerically accurate results on x87.
        
             | kloch wrote:
             | I did tests on Kahan summation recently on my macbook pro
             | and -O3 defeated the algorithm while -O2 did not. Declaring
             | the below variables as volatile restored error compensation
             | with -O3.
             | 
             | The relevant code is:
             | kahan_y=g_sample_z - kahan_c;
             | kahan_t=g_sample_z_sum + kahan_y;
             | kahan_c=(kahan_t - g_sample_z_sum) - kahan_y;
             | g_sample_z_sum=kahan_t;
             | 
             | (this is in an inner loop where a new g_sample_z is
             | calculated and then added to a running g_sample_z_sum with
             | this snippet)
        
               | vlovich123 wrote:
               | Sounds like a compiler bug to me. Can you file a bug to
               | clang with a reduced standalone test (or I can do it for
               | you if you share the standalone test).
        
               | [deleted]
        
               | [deleted]
        
               | kloch wrote:
               | Here is a complete simplified Kahan summation test and
               | indeed it works with -O3 but fails with -Ofast. There
               | must have been something else going on in my real program
               | at -O3. However the original point that 'volatile' can be
               | a workaround for some optimization problems is still
               | valid (you may want the rest of your program to benefit
               | from -Ofast without breaking certain parts).
               | 
               | Changing the three kahan_* variables to volatile makes
               | this work (slowly) with -Ofast.                 #include
               | <stdio.h>            int main(int argc, char **argv) {
               | int i;         double sample, sum;         double
               | kahan_y, kahan_t, kahan_c;              // initial values
               | sum=0.0;         sample=1.0; // start with "large" value
               | for (i=0; i <= 1000000000; i++) { // add 1 large value
               | plus 1 billion small values           // Kahan summation
               | algorithm           kahan_y=sample - kahan_c;
               | kahan_t=sum + kahan_y;           kahan_c=(kahan_t - sum)
               | - kahan_y;           sum=kahan_t;                // pre-
               | load next small value           sample=1.0E-20;         }
               | printf("sum: %.15f\n", sum);       }
        
               | hermitdev wrote:
               | I hope this isn't the actual "real" code, because you've
               | got undefined behavior before you even have to worry
               | about the associativity optimizations. There's an
               | uninitialized read of 'kahan_c' on the first loop
               | iteration.
        
               | vlovich123 wrote:
               | Correct. `-Ofast` claim to fame is it enables `-ffast-
               | math` which is why it has huge warning signs around it in
               | the documentation. `-ffast-math` turns on associativity
               | which is problematic for Kahan summation. Rather than
               | sprinkling in volatiles which pessimizes the compiler to
               | no end, I would recommend annotating the problematic
               | function to turn off associativity [1][2].
               | 
               | Something like:                   [[gnu::optimize("no-
               | associative-math")]]         double kahanSummation() {
               | ...         }
               | 
               | That way the compiler applies all the optimizations it
               | can but only turns off associative math. This should work
               | on Clang & GCC & be net faster in all cases.
               | 
               | This is what I mean by "If you're sprinkling volatile
               | around, you probably aren't doing what you want" and are
               | just cargo culting bad advice.
               | 
               | [1] https://stackoverflow.com/questions/26266820/in-
               | clang-how-do... [2]
               | https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-
               | Attrib...
        
         | leeter wrote:
         | Well their use is definitely UB as it creates data races.
         | Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z
        
         | gpderetta wrote:
         | As noted elsethread, the code is indeed wrong even on x86,
         | although you only need compiler fences there.
        
         | [deleted]
        
         | dataangel wrote:
         | I think their point is you only need compiler barriers not
         | actual barrier instructions on x86. volatile in practice has
         | been the de facto way to get the effect of a compiler memory
         | barrier for a long time even though it's not the best way to do
         | it nowadays. The original purpose of it is literally preventing
         | the compiler from getting rid of loads and stores and
         | reordering them which is exactly what is needed when
         | implementing a lockless FIFO. As long as all the stores and
         | loads (including the actual FIFO payload) are volatile, it will
         | work (volatile loads and stores are guaranteed to not be
         | reordered with each other). After that the x86 guarantees about
         | not reordering are very strong. Really the best argument
         | against volatile for this kind of thing is actually the
         | opposite of your point, _volatile is too strong_. It prevents
         | more reordering than you actually want. Acquire /release
         | semantics are less strong and give the compiler more
         | flexibility.
        
         | scatters wrote:
         | Another place it's meaningful to use `volatile` is in
         | benchmarking and testing: to either ensure that a block of code
         | is run despite not having any side effects, or to ensure that a
         | block of code that should _not_ be run is still compiled and
         | emitted to binary.
         | 
         | But yes, `volatile` for what should be atomics is a clear code
         | smell. I made quite a loud noise when I read "the code quality
         | looks excellent" in the article.
        
       | silisili wrote:
       | > Nowadays, high-performance processors, like those found in
       | desktops, servers, and phones, are massively out-of-order to
       | exploit instruction-level parallelism as much as possible. They
       | perform all sorts of tricks to improve performance.
       | 
       | Relevant quote from Jim Keller: You run this program a hundred
       | times, it never runs the same way twice. Ever.
        
         | krylon wrote:
         | Heraclitus, mumbling into his beard: "Told you so!"
         | 
         | SCNR
        
         | vlovich123 wrote:
         | A hundred times is not that much except for really cold code
         | paths. It's probably in the billions if not more and I have to
         | imagine that software level effects typically swamp HW-level
         | effects here. That's why you see software typically having a
         | performance deviation no greater than ~5-10% unless you're
         | running microbenchmarks.
        
       | Azsy wrote:
       | Have i told you about our lord and savior Rust?
       | 
       | Anyways, https://github.com/tokio-rs/loom is used by any serious
       | library doing atomic ops/synchronization and it blew me away with
       | how fast it can catch most bugs like this.
        
         | nyanpasu64 wrote:
         | Rust doesn't catch memory ordering errors, which can result in
         | behavioral bugs in safe Rust and data races and memory unsafety
         | in unsafe Rust. But Loom is an excellent tool for catching
         | ordering errors, though its UnsafeCell API differs from std's
         | (and worse yet, some people report Loom returns false
         | positives/negatives in some cases: https://github.com/tokio-
         | rs/loom/issues/180, possibly https://github.com/tokio-
         | rs/loom/issues/166).
        
           | tialaramex wrote:
           | > which can result in behavioral bugs in safe Rust
           | 
           | For example, Rust doesn't have any way to know that your
           | chosen lock-free algorithm relies on Acquire-release
           | semantics to perform as intended, and so if you write safe
           | Rust to implement it with Relaxed ordering, it will compile,
           | and run, and on x86-64 it will even _work_ just fine because
           | the cheap behaviour on x86-64 _has_ Acquire-release semantics
           | anyway. But on ARM your program doesn 't work because ARM
           | really does have a Relaxed mode and without Acquire-release
           | what you've got is not the clever lock-free algorithm you
           | intended after all.
           | 
           | However, if you don't even understand what Ordering is, and
           | just try to implement the naive algorithm in Rust without
           | Atomic operations that take an Ordering, Rust won't compile
           | your program at all because it could race. So this way you
           | are at least confronted with the fact that it's time to learn
           | about Ordering if you want to implement this algorithm and if
           | you pick Relaxed you can keep the resulting (safe) mess you
           | made.
        
           | CodesInChaos wrote:
           | It doesn't catch all of them. But data-races on plain memory
           | access are impossible in safe rust.
           | 
           | And atomics force you to specify an ordering on every access,
           | which helps both the writer (forced to think about which
           | ordering they need) and reviewer (by communicating intent).
        
           | Fiahil wrote:
           | I think it's fixable, the main reactor is what matters. You
           | can add or remove as many synchronisation primitive as you
           | like.
           | 
           | Other tooling, like Jepsen, will interact with your program
           | at a higher level.
        
       | im3w1l wrote:
       | And arm-windows will (does already?) run x86 binaries with weaker
       | memory ordering than they were written for. So this could be a
       | real thing soon.
        
         | belter wrote:
         | Now I am worried. Do you have a reference please?
        
           | im3w1l wrote:
           | Best I could find. It's not a great reference because it
           | doesn't give any details but it does prove that it's a thing.
           | https://docs.microsoft.com/en-us/windows/uwp/porting/apps-
           | on...
        
         | kevingadd wrote:
         | Are you sure the translators don't insert code necessary to
         | maintain ordering? I would be shocked if most threaded code
         | works when you throw out the x86 memory model. Managed runtimes
         | like .NET definitely generate code for each target designed to
         | maintain the correct memory model.
        
           | pmuderoc wrote:
           | They better do, but then, how would an automatic translator
           | know that this is a "release semantics" atomic store
           | operation?
           | 
           | Because on x86 it is, no special barriers or instructions
           | necessary.
           | 
           | mov [shared_data], 1
           | 
           | mov [release_flag], 1
        
             | my123 wrote:
             | It's pessimistic and converts over a lot of memory accesses
             | to RCpc or atomics.
             | 
             | (on ARMv8.0 where you don't have those, barriers are used
             | more)
             | 
             | TSO pessimization is the only way to make the thing work at
             | a translation time cost that isn't too high.
        
               | gpderetta wrote:
               | Or you support TSO directly on your cpu like Apple does
               | on M1.
        
               | tsimionescu wrote:
               | Sure, but Windows on ARM has to run on many ARM
               | processors, not a specific one designed by MS. They could
               | detect if the processor has non-standard TSO support and
               | use that when running an x86 app, but they still have to
               | do something to run the x86 app on a standard ARM
               | processor.
        
           | my123 wrote:
           | Maintaining the memory model guarantees is what causes the
           | steep cost in performance when using x86 apps on Windows on
           | Arm.
           | 
           | That said, heuristics are used to speed it up. I would
           | recommend not sharing values in the stack between threads for
           | synchronisation for example.
        
           | nyanpasu64 wrote:
           | https://docs.microsoft.com/en-us/windows/uwp/porting/apps-
           | on...
           | 
           | > You can also select multi-core settings, as shown here...
           | These settings change the number of memory barriers used to
           | synchronize memory accesses between cores in apps during
           | emulation. Fast is the default mode, but the strict and very
           | strict options will increase the number of barriers. This
           | slows down the app, but reduces the risk of app errors. The
           | single-core option removes all barriers but forces all app
           | threads to run on a single core.
           | 
           | https://news.ycombinator.com/item?id=28732273
           | 
           | zamadatix's interprets this as Microsoft saying that by
           | default, Windows on ARM runs x86 apps _without_ x86 TSO, and
           | turns on extra memory barriers using per-app compatibility
           | settings. But if an app needs TSO but isn 't in Windows's
           | database, it will crash or silently corrupt data.
        
         | xxs wrote:
         | Normally the code should have all the needed memory fences as
         | if running on DEC Alpha, e.g. linux does that, and the
         | compilers omit the unneeded ones.
        
           | monocasa wrote:
           | And since the compiler omitted it on x86, an x86 emulator
           | doesn't have access to where they're required as seen by the
           | compiler.
        
             | xxs wrote:
             | emulator would have a zero issue, if it's a direct transfer
             | for assembly (not an emulator), it'd need either hardware
             | support - e.g. apple chips, or memory barriers.
             | 
             | The differences between arm and x86 are known for 15y+,
             | there is nothing new about it. Also concurrency support is
             | one of the major benefits of languages with proper memory
             | model - java started it with JMM[0]
             | 
             | [0]: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCh
             | eckedL...
        
               | monocasa wrote:
               | The differences are very known, it's just still an open
               | problem how to address running code for a stronger memory
               | model on systems with a weaker one at as high as possible
               | performance performance without explicit hardware support
               | (like Apple's choice of a TSO config bit). Your compiled
               | binary for TSO has erased any LoadLoad, LoadStore, and
               | StoreStore barriers, and the emulator has to divine them.
               | The heuristics there are still fraught with peril.
               | 
               | The JVM absolutely did some great work walking this path,
               | both in defining a memory model in the first place, and
               | supporting that model on weak and strong hardware memory
               | models, but the JMM was specifically designed to be able
               | to run cleanly on WMO platforms to begin with (early
               | Sparc), so they don't face a lot of the same problems
               | discussed here.
        
               | gpderetta wrote:
               | Any emulator that wants to be remotely performance
               | competitive will do dynamic translation (i.e JIT). In
               | fact ahead-of-time translation is not really feasible.
               | 
               | Memory models and JVM are not really relevant when
               | discussing running binaries for a different architecture.
        
               | secondcoming wrote:
               | Doesn't the JVM define its own memory model?
        
               | gpderetta wrote:
               | Sure, but how's that relevant when discussing running x86
               | binaries on ARM?
        
               | xxs wrote:
               | Ah well, back in the day, prior JMM, there was no notion
               | of memory models at all, and no language had one; Hence,
               | I referenced the original paper that started it all. The
               | point was that it happened long time ago, and there is
               | nothing new about the current case.
        
               | secondcoming wrote:
               | Sorry, I thought you were talking about running JVM
               | binaries on different architectures
        
               | [deleted]
        
               | xxs wrote:
               | The memory models are relevant as the
               | translation/JIT/whatever has to take them in
               | consideration. TSO is well known and it's also well known
               | how arm weaker memory model needs memory barriers to
               | emulate TSO.
               | 
               | If there is a JIT I'd expect to be able to add a read
               | barriers, on memory location allocated by another thread
               | - incl. allocating bits in the pointers and masking them
               | off on each dereference. If any block appears to be
               | shared - the code that allocated it would need to be
               | recompiled with memory store-store barriers; the reading
               | part would need load-load and so on. There are quite a
               | few ways to deal with the case, aside the obvious - make
               | the hardware compatible.
               | 
               | If in end it's not an easy feat to make up for the
               | stronger memory model correctness, yet correctness should
               | be a prime goal of an 'emulator'
        
               | stonemetal12 wrote:
               | It is relevant as an example of how to write a JIT for a
               | specific memory model that will run on a different
               | architecture with a different memory model. AKA it is a
               | known issue that has been successfully dealt with for
               | quite a while now.
        
       ___________________________________________________________________
       (page generated 2021-10-26 23:02 UTC)