[HN Gopher] A bug that doesn't exist on x86: Exploiting an ARM-o...
___________________________________________________________________
A bug that doesn't exist on x86: Exploiting an ARM-only race
condition
Author : stong1
Score : 279 points
Date : 2021-10-26 04:47 UTC (18 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| agalunar wrote:
| Great write-up!
|
| There may be a typo in section 3:
|
| > It will happily retire instruction 6 before instruction 5.
|
| If memory serves, although instructions can execute out-of-order,
| they retire in-order (hence the "re-order buffer").
| stong1 wrote:
| Nice catch. I fixed it. I should have said "execute" rather
| than "retire".
| colejohnson66 wrote:
| You are correct. The retire unit ensures that all micro ops are
| retired in order
| vitus wrote:
| I spent some time trying to figure out why the lock-free
| read/write implementation is correct under x86, assuming a
| multiprocessor environment.
|
| My read of the situation was that there's already potential for a
| double-read / double-write between when the spinlock returns and
| when the head/tail index is updated.
|
| Turns out that I was missing something: there's only one producer
| thread, and only one consumer thread. If there were multiple of
| either, then this code would be more fundamentally broken.
|
| That said: IMO the use of `new` in modern C++ (as is the case in
| the writer queue) is often a code smell, especially when
| std::make_unique would work just as well. Using a unique_ptr
| would obviate the first concern [0] about the copy constructor
| not being deleted.
|
| (If we used unique_ptr consistently here, we might fix the scary
| platform-dependent leak in exchange for a likely segfault
| following a nullptr dereference.)
|
| One other comment: the explanation in [1] is slightly incorrect:
|
| > we receive back Result* pointers from the results queue rq,
| then wrap them in a std::unique_ptr and jam them into a vector.
|
| We actually receive unique_ptrs from the results queue, then
| because, um, reasons (probably that we forgot that we made this a
| unique_ptr), we're wrapping them in another unique_ptr, which
| works because we're passing a temporary (well, prvalue in C++17)
| to unique_ptr's constructor -- while that looks like it might
| invoke the deleted copy-constructor, it's actually an instance of
| guaranteed copy elision. Also a bit weird to see, but not an
| issue of correctness.
|
| [0] https://github.com/stong/how-to-exploit-a-double-
| free#0-inte...
|
| [1] https://github.com/stong/how-to-exploit-a-double-
| free#2-rece...
| aydwi wrote:
| > IMO the use of `new` in modern C++ (as is the case in the
| writer queue) is often a code smell
|
| As a naive practitioner of modern C++, I'd love it if you could
| elaborate on this.
| ddlutz wrote:
| It's widely accepted you should almost always use some smart
| pointer like unique_ptr or shared_ptr, and even sometimes the
| the unpopular weak_ptr depending on your use case.
| usefulcat wrote:
| Whenever you use 'new', you have to decide what is going to
| 'own' the newly allocated thing. You'll also have to remember
| to call 'delete' somewhere.
|
| Using unique_ptr/make_unique() or shared_ptr/make_shared()
| automates lifetime management (obviates the need for a manual
| 'delete') and makes the ownership policy explicit. They also
| have appropriately defined copying behavior. For example:
| struct Foo { // lots of stuff here ... };
| struct A { Foo* f = new Foo; ~A() {
| delete f; } }; struct B {
| std::unique_ptr<Foo> f = std::make_unique<Foo>();
| // no need to define a dtor; the default dtor is fine
| };
|
| For the destructor and the default constructor, compilers
| will generate basically identical code for both A and B
| above. If you try to copy a B, the compiler won't let you
| because unique_ptr isn't copyable. However it won't stop you
| from copying an A, even though as written (using the default
| copy ctor) that's almost certainly a mistake and will likely
| result in a double free in ~A().
| PaulDavisThe1st wrote:
| > Turns out that I was missing something:
|
| Indeed. It's not safe under x86 either.
| stong1 wrote:
| Great points. I made some minor edits to address that and
| clarify some things. Thanks!
| 0xfaded wrote:
| My first gen threadripper occasionally deadlocks in futex code
| within libgomp (gnu implementation of omp). Eventually I gave up
| and concluded it was either a hardware bug or a bug that
| incorrectly relies on atomic behaviour of intel CPUs. I
| eventually switched to using clang with its own omp
| implementation and the problem magically disappeared.
| amelius wrote:
| Does the race condition exist when emulating x86 on Apple M1?
| mmwelt wrote:
| Apple added hardware support for x86 memory semantics.
|
| https://news.ycombinator.com/item?id=28731534
|
| https://mobile.twitter.com/ErrataRob/status/1331735383193903...
| saagarjha wrote:
| No. Rosetta emulates TSO correctly.
| addaon wrote:
| To draw together the two answers here to the original
| question.
|
| 1) Emulating an ISA includes emulating its memory model. As
| saagarjha says, this means that Rosetta 2 must (and does)
| correctly implement total store ordering.
|
| 2) There are various ways to implement this. For emulators
| that include a binary translation layer (that is, that
| translate x86 opcodes into a sequence of ARM opcodes), one
| route is to generate the appropriate ARM memory barriers as
| part of the translation. Even with optimization to reduce the
| number of necessary barriers, though, this is expensive.
| Instead, as mmwelt mentions, Apple took an unusual route
| here. The Apple Silicon MMU can be configured on a per-page
| basis to use either the relaxed ARM memory model or the TSO
| x86 memory model. There is a performance cost at the hardware
| level for using TSO, and there is a cost in silicon area for
| supporting both; but from the point of view of Rosetta 2, all
| it has to do is mark x86-accessed pages as TSO and the
| hardware takes care of the details, no software memory
| barriers needed.
| saagarjha wrote:
| TSO is a per-thread flag; it's implemented as a write to a
| MSR on context switch to switch it on or off.
| cookiewill wrote:
| Is it normal for the .got.plt section to be writable rather than
| read-only?
| anyfoo wrote:
| Heh, 10 years ago I gave a presentation about how easy folks used
| to x86 can trip up when dealing with ARM's weaker memory model.
| My demonstration then was with a naive implementation of
| Peterson's algorithm.[1]
|
| I have a feeling that we will see a sharp rise of stories like
| this, now that ARM finds itself in more places which were
| previously mostly occupied by x86, and all the subtle race
| conditions that x86's memory model forgave actually start
| failing, in equally subtle ways.
|
| [1] The conclusion for this particular audience was: Don't try to
| avoid synchronization primitives, or even invent your own. They
| were not system level nor high perf code programmers, so they had
| that luxury.
| gpderetta wrote:
| But Peterson's algorithm requires explicit memory barriers even
| on x86, it doesn't seem the best example to show the
| difference.
| anyfoo wrote:
| Here are my slides from back then:
| https://reinference.net/mp-talk.pdf
|
| You made me wonder, because I definitely remember using
| Peterson's Algorithm, so I went back to my slides and turns
| out: I first showed the problem with x86, then indeed added
| an MFENCE at the right place, and then showed how that was
| not enough for ARM. So the point back then was to show how
| weaker memory models can bite you with the example of x86,
| and then to show how it can still bite you on ARM with its
| even weaker model (ARMv7 at that time, and C11 atomics aren't
| mentioned yet either, but their old OS-specific support is).
| gpderetta wrote:
| Oh, right, yes, ARM additionally needs a release barrier on
| the unlock path.
| stjohnswarts wrote:
| I doubt that. The number of ARM processors is far greater in
| reality than in x86 if we clarify it by saying "in operation"
| rather than historically and these stories will become more
| common but certainly won't see a "sharp increase".
| codeflo wrote:
| This sort of bug only happens when running a multithreaded
| program (with shared memory) on a multicore processor.
|
| You do need both for the problem to happen: Without shared
| memory, there's nothing to exploit. And with a single core
| only, you get time-sliced multithreading, which orders all
| operations.
|
| My point is, that combination was a lot rarer in ARM land
| before people started doing serious server or desktop
| computing with those chips.
| retrac wrote:
| Of course. Any such flaws in the Linux kernel or any library
| used by Android should have been found by now, for example.
| But the number of ARM processors running
| developer/server/desktop stacks has been tiny until recently.
| In my experience, quite a lot of Linux on desktop software
| fails to even build on non x86_64 machines.
| stjohnswarts wrote:
| Are you kidding? Arm computers are by far the most common
| over the past 10 years. Computers are everywhere and
| servers and home computers account for at most 10% of the
| market for cpus and microcontrollers.
| masklinn wrote:
| The vast majority of those are not running multitheaded
| workloads written by complete randos.
| SavantIdiot wrote:
| The dominant Arm core in the world is a Cortex-M (or
| Cortex-R) which are single-core. They are 99% of the time on
| a die with far less <512K SRAM, and run an RToS or baremetal.
|
| These outnumber x86+Cortex-A by probably a factor of 1,000.
| rsynnott wrote:
| There are almost certainly more multi-core ARM chips than
| x86 chips around, too, tho.
| SavantIdiot wrote:
| That's a good question. I guess we would need to compare
| all the arm multi-core in smartphones & some laptops, vs
| all the intel laptops/desktops/servers. Hmmm... tough
| one.
| mwcampbell wrote:
| > Don't try to avoid synchronization primitives, or even invent
| your own.
|
| Makes me wonder if it's really a good idea in most cases to
| use, for example, the Rust parking_lot crate, which
| reimplements mutexes and RW locks. Besides the speed boost for
| uncontended RW locks, particularly on x64 Linux, what I really
| like about parking_lot is that a write lock can be downgraded
| to a read lock without letting go of the lock. But maybe I'd be
| better off sticking with the tried-and-true OS-provided lock
| implementations and finding another way to work around the lack
| of a downgrade option.
| cesarb wrote:
| Unless you're the maintainer of the parking_lot crate, you're
| not "inventing your own". And since parking_lot is AFAIK the
| second most popular implementation of mutexes and RW locks in
| Rust (the most popular one being obviously the one in the
| Rust standard library, which wraps the OS-provided lock
| implementations), you can assume it's well tested.
| codeflo wrote:
| Everything about which people tell you to "not invent your
| own" must be invented by _someone_.
| twodai wrote:
| I guess it's more about priorities, if you want to build
| a web app don't reinvent the underlying protocols. If you
| want to build an embedded system for monitoring room
| temperature don't invent your own locks.
|
| But if you want to make a lock by all means make a lock,
| just don't go and reinvent the chip architecture...
| JulianMorrison wrote:
| It's not "don't invent it".
|
| It's "be competent before you invent it, because it's
| hard". And if you aren't, then let someone who is do the
| inventing.
| convolvatron wrote:
| the best way to _get_ competent is to try
|
| so yes - invent your own synchronization primitives.
| please.
|
| just dont believe they are correct without being serious
| about trying to prove they are. and dont hold up your
| whole project for self-enrichment.
|
| but try to layer as much in as you can.
|
| developers these days are so productive, until they fall
| down and cant get up. and then they are completely
| useless.
| api wrote:
| In that case you should never run any software. Go live
| as a hunter-gatherer in the wilderness.
| oriki wrote:
| I mean, that's too literal of an interpretation.
|
| It's more like "if you don't need to, don't invent your
| own [x]." People who like to invent [x] are usually smart
| enough to understand why that warning is there to begin
| with, and don't tend to argue with it.
| pornel wrote:
| You're taking a saying too literally.
|
| People who take on task of writing such library, and
| develop it to the point it's a language-wide standard,
| usually know what they're doing (or learn on the job :)
|
| Popularity of such library helps test it thoroughly on
| many platforms in various conditions, so there's a high
| chance the bugs will be spotted and fixed.
| half-kh-hacker wrote:
| this slaps. I always see perfect blue a few places above us!
| nyanpasu64 wrote:
| Context for downvoters: "perfect blue" is the CTF group writing
| this article, and "a few places" means CTF team rankings in
| competitions.
| beebmam wrote:
| Like quantum physics, memory ordering is deeply unintuitive (on
| platforms like ARM). Unlike quantum physics, which is an
| unfortunate immutable fact of the universe, we got ourselves into
| this mess and we have no one to blame but ourselves for it.
|
| I'm only somewhat joking. People need to understand these memory
| models if they intend on writing atomic operations in their
| software, even if they aren't currently targeting ARM platforms.
| In this era, it's absurdly easy to change an an LLVM compiler to
| target aarch64, and it will happen for plenty of software that
| was written without ever considering the differences in atomic
| behavior on this platform.
| pclmulqdq wrote:
| One piece of friction that hurts here is that the C++/Rust and
| ARM memory models aren't the same, and the consequences of this
| are unintuitive - compilers and CPUs can both screw with
| execution ordering.
|
| People who write in C++ should technically _only_ be concerned
| with the C++ memory model, but x86 has let them be very lax and
| undisciplined with std::memory_order_relaxed. ARM has some
| alluring constructs that don't quite fit with the C++ model,
| which can tempt you to mix and match memory models for
| performance. All of this means trouble with atomics.
| [deleted]
| newpavlov wrote:
| Memory ordering gets somewhat easier after you understand that
| flat memory shared by execution units is a leaky abstraction
| desperately patched over decades by layer and layers of
| hardware and software. Memory ordering is one way to represent
| message passing and synchronization between different cores and
| RAM. This why I think that "lock-free algorithms" is a
| misnomer, you still have synchronization, but you simply rely
| on hardware for it.
| uuidgen wrote:
| > "lock-free algorithms" is a misnomer, you still have
| synchronization,
|
| Lock-free doesn't mean that there is no synchronization. It
| is a way to synchronize memory access between threads from
| the start. It means that there is no additional locking to
| protect access to the shared resource - all read access is
| valid, from any number of simultaneous write accesses at
| least one succeeds (which is not true for some other
| algorithms like network exponential backoff).
|
| Even on x86 the most common instruction you use is LOCK
| cmpxchg.
| gpderetta wrote:
| That's actually a common misconception. Memory ordering, on
| the majority of common cpus, has nothing to do with
| interprocessor communication or processor-ram communication.
| Common memory coherency protocols (I.e. MESI and derivatives)
| guarantee that all caches have a consistent view of memory.
|
| Usually memory reordering is purely artifact of the way CPUs
| access their private L1-cache.
| yvdriess wrote:
| For the record, this is false. It is conflating memory
| coherency with consistency.
|
| Nearly everything in a modern processor is a source of
| reordering, from branch prediction to basically everything
| in the OoO backend. Any time you leave the core, there's
| reordering happening in the network. And yes, that includes
| caches, which involve a heavy amount of inter-core
| communication. When you issue two successive loads to
| different cache lines, which one is going to return first?
|
| The OoO backend itself manages hazards and ensures that
| ld/st instructions are retired in the correct order to
| maintain the processor's memory consistency model. Software
| can build on top of that, e.g. with fences, to impose
| stricter consistency models.
| gpderetta wrote:
| Sorry, what is false? You seem to be agreeing with me?
|
| edit: to clarify, I claim that the coherency layer (i.e
| the L1 cache and beyond), does not introduce any
| reordering issues, at least for common cpus using
| variants of MESI.
| dooglius wrote:
| Hardware synchronization has better properties than software
| locks: it can't deadlock, is reentrant, won't get screwed up
| by a process holding a lock dying, and is (supposedly)
| guaranteed to complete in bounded time. I don't think it's
| unreasonable that the definition of lock-free ("guaranteed
| system-wide progress") focuses on the high-level behavior of
| typical software locks even if it ends up calling things that
| are still locks in some sense "lock-free".
| hnfong wrote:
| That's the benefit you get from having code that defaults
| to a race condition instead of defaulting to deadlock.
|
| (Which is better? I don't know.)
| rendaw wrote:
| Is there anything out there that exposes a better or tighter
| abstraction? Something not flat?
| yvdriess wrote:
| A Dataflow architecture ISA would. It's been tried before.
| But, working out the entire software stack from scratch is
| a moonshot.
| formerly_proven wrote:
| High-performance processors are data flow processors,
| which infer the data flow graph from the instruction
| stream using Tomasulo's algorithm.
| devit wrote:
| There are systems like the PS3 SPEs and DSPs where you only
| have normal access to local on-chip memory and have to
| explicitly initiate DMA to access external memory.
|
| But that's just bad for running general purpose software
| that can require more memory than available locally since
| it means you have to do memory cache management in software
| which is going to be much slower than letting the hardware
| do it.
| formerly_proven wrote:
| In practice you _want_ memory reordering to be a thing
| because that 's what allows you to reorder instructions
| that touch memory (both at compile time, and also at
| runtime by the processor), which is what enables a large
| part of the latency hiding that's going on.
| drcongo wrote:
| Nice try Intel.
| pcwalton wrote:
| Lock-free programming is really tough. There are really only a
| few patterns that work (e.g. Treiber stack). Trying to invent a
| new lock-free algorithm, as this vulnerable code demonstrates,
| almost always ends in tears.
| platinumrad wrote:
| There's no new invention in here. Just an (intentional) misuse
| of "volatile".
| nyanpasu64 wrote:
| IMO lock-free MP or MC algorithms are harder to get right than
| SPSC structures (atomics for shared memory, queues for
| messaging, triple buffers for tear-free shared memory). But
| even SPSC algorithms can be tricky; I've found the same
| (theoretical) ordering error in _three_ separate Rust
| implementations of triple buffering (one of them mine), written
| by people who 've already learned the ordering rules (which I
| caught with Loom). And initially learning to reason about
| memory ordering is a major upfront challenge too.
| ohazi wrote:
| I particularly like lock-free (wait-free?) SPSC queues
| because they're (relatively) easy to get right, and are
| _extremely_ useful for buffering in embedded systems. I end
| up with something like this on almost every project:
|
| One side of the queue is a peripheral like a serial port that
| needs to be fed/drained like clockwork to avoid losing data
| or glitching (e.g. via interrupts or DMA), and the other side
| is usually software running on the main thread, that wants to
| be able to work at its own pace and also go to sleep
| sometimes.
|
| An SPSC queue fits this use-case nicely. James Munns has a
| fancy one written in Rust [1], and I have a ~100 line C
| template [2].
|
| [1] https://github.com/jamesmunns/bbqueue
|
| [2] https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638
| d70...
| reitzensteinm wrote:
| I'd be interested in knowing the details of the error!
| nyanpasu64 wrote:
| https://github.com/HadrienG2/triple-buffer/issues/14
| xxs wrote:
| There are tons of lock-free algorithms, both node based and
| array backed up. Lock-free is notoriously easier on garbage
| collector set-ups, of course.
| PaulDavisThe1st wrote:
| This isn't a new lock-free algorithm. Single-reader, single-
| write FIFOs are one of the oldest approaches around.
|
| They have to be tweaked when execution isn't guaranteed (by
| using memory barriers). TFA is about an exploit based on code
| that hasn't added the required memory barriers.
| sydthrowaway wrote:
| Any good references on low level details on ARMv8+?
| gpderetta wrote:
| The best part is that the original code is not safe even on x86
| as the compiler can still reorder non-volatile accesses to the
| backing_buf around the volatile accesses to head and tails.
| Compiler barriers before the volatile stores and after volatile
| reads are required [1]. It would still be very questionable code,
| but it would at least have a chance to work on its intended
| target.
|
| tl;dr: just use std::atomic.
|
| [1] it is of course possible they are actually present in the
| original code and just omitted from the explanation for brevity
| reitzensteinm wrote:
| For those interested in memory ordering, I have a few posts on my
| blog where I build a simulator capable of understanding
| reorderings and analyze examples with it:
|
| https://www.reitzen.com/post/temporal-fuzzing-01/
| https://www.reitzen.com/post/temporal-fuzzing-02/
|
| Next step are some lock free queues, although I haven't gotten
| around to publishing them!
| secondcoming wrote:
| There is a proposal (possibly accepted) to deprecate 'volatile'
| in C++.
|
| http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...
| tialaramex wrote:
| Yes P1152 was taken for C++ 20.
|
| The purpose of abolishing volatile isn't so much to reinforce
| that it's not intended for this sort of threading nonsense
| (indeed on Windows the MSVC guarantees mean it almost _is_
| intended for this sort of nonsense) but to make it explicit
| that "volatile variables" were never really a thing anyway by
| abolishing the volatile qualifier on variables.
|
| The thing your hardware can actually do is almost exactly:
| https://doc.rust-lang.org/core/ptr/fn.read_volatile.html and
| https://doc.rust-lang.org/core/ptr/fn.write_volatile.html
|
| And sure enough that's equivalent to what is proposed for C++
| although not in just this one paper.
|
| With "volatile variables" you can use compound assignment
| operators on the variable. What does that even mean? Nothing.
| It means nothing, it's gibberish, but you can do it and people
| do. They presumably _thought_ it meant something and since it
| doesn 't they were wrong. So, deprecate this and maybe they'll
| go read up on the subject.
|
| You can also in C++ declare things that clearly aren't in the
| least bit volatile, as volatile anyway. C++ has volatile member
| variables, volatile member functions, volatile parameters...
| Any code that seems to rely on this probably doesn't do what
| the people who wrote it thought it does, run away.
| mhh__ wrote:
| volatile primitives is how D does volatile as well.
|
| I do sort of miss having a basic volatile (although I can
| write my own type somewhat effectively) just for
| benchmarking's sake sometimes.
| agent327 wrote:
| >With "volatile variables" you can use compound assignment
| operators on the variable. What does that even mean? Nothing.
|
| It means exactly the same thing as on a normal variable, and
| it boggles the mind that people somehow not understand that.
| Given 'volatile int i', 'i++' means the exact same thing as
| 'i = i + 1'. Does that also not make any sense to you? If it
| does, can you explain why you believe they are different?
|
| Volatile member functions and parameters make no sense, but
| volatile member variables most certainly do. And there is
| considerable pushback in the C++ community because this is a
| significant loss of compatibility with various C-headers used
| frequently in embedded applications. I wouldn't be surprised
| if the deprecated features will be reinstated in the language
| in the end.
| gpderetta wrote:
| Yes, it will likely be reverted [1] for bitwise ops, while
| keeping += and ++ deprecated (to save face someone would
| say).
|
| [1] http://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2021/p232...
| tialaramex wrote:
| It's true that the relevant committee is now likely to
| propose this be changed in C++ 23 (or before as an
| errata)
|
| But the face saving is on the part of embedded C++
| proponents who'd rather _never fix the C++ language_ than
| risk that the mundane C code cheap vendors actually ship
| can 't be counted as "C++ support" because it's no longer
| valid C++.
|
| That's the core of the argument in P2327. Not that this
| isn't a bad idea (it's obviously a bad idea) or that we
| shouldn't do better (alternatives to C++ already do
| better) but that this is the status quo and C++ can't
| expect to improve upon that. A sad state of affairs.
|
| P2327's dismissal of the problem caused by UART1->UCSR0B
| |= (1<<UCSZ01 ); comes down to the usual Real Programmer
| stance, surely every single embedded programmer will
| arrange "by construction" that the problem can't happen.
| No actual examples were examined to see if embedded
| developers reliably do that, which seems unlikely - we're
| just invited to imagine that they're all inerrant.
| vlovich123 wrote:
| Not sure what will actually happen, but you could easily
| allow volatile variables with if they have extern "C"
| linkage for compatibility with C headers, while deprecating
| it elsewhere.
| agent327 wrote:
| That's not really what extern "C" does though. It doesn't
| change the language rules or the parsing or anything, it
| only changes how symbols are represented in the object
| file. Extern "C" means they are mangled using C rules
| (that mostly involves adding an underscore); otherwise
| it's gonna be C++ rules (and loads more information needs
| to be encoded).
|
| Some people argue for a similar mechanism, something like
| 'language "C++20" { .. }', that would allow a program to
| opt in to changes that would otherwise be breaking
| changes; mostly new keywords. However, changing actual
| language rules in such a block would be tricky, to say
| the least.
| tialaramex wrote:
| > It means exactly the same thing as on a normal variable,
| and it boggles the mind that people somehow not understand
| that.
|
| If you just want normal variables, write a normal variable,
| J F. Bastien's change doesn't have any consequences for
| your normal variables.
|
| > Given 'volatile int i', 'i++' means the exact same thing
| as 'i = i + 1'
|
| No. That's the post-increment operator so i++ has the same
| value as i did before the increment, whereas i = i + 1 has
| the value of i _after_ the increment.
|
| And you might say "Oh, but I never use those values". Good.
| J.F. Bastien's change forbids that for volatiles too, we're
| on the same page.
|
| What you apparently want is to do two volatile operations,
| one read and one write, sandwiching an arbitrary integer
| operation. You can do that explicitly of course, Bastien's
| change doesn't alter that - but when you write it out,
| doubt appears on the horizon. If these are two operations,
| can't I race with somebody else such that they modify this
| thing before I write my modified version back and then I
| overwrite their changes?
|
| And that doubt should be there. But what we know from
| asking people is that the compound operators _falsely_
| reassure them. That 's why they were deprecated and why
| it's sad that there's enthusiasm to bring them back.
| PaulDavisThe1st wrote:
| Either I'm not understanding something that I thought I
| understood very well, or TFA's author's don't understand
| something that they think they understand very well.
|
| Their code is unsafe even on x86. You cannot write a single-
| writer, single-reader FIFO on modern processors without the use
| of memory barriers.
|
| Their attempt to use "volatile" instead of memory barriers is not
| appropriate. It could easily cause problems on x86 platforms in
| just the same way that it could on ARM. "volatile" does not mean
| what you think it means; if you're using it for anything other
| than interacting with hardware registers in a device driver,
| you're almost certainly using it incorrectly.
|
| You must use the correct memory barriers to protect the
| read/write of what they call "head" and "tail". Without them, the
| code is just wrong, no matter what the platform.
| stong1 wrote:
| Right. The challenge is written incorrectly on purpose,
| otherwise the code isn't vulnerable. The use of volatile is a
| bit of a misdirection for the CTF players, since you're right
| that it's a common misconception that volatile acts like a
| barrier.
|
| > You cannot write a single-writer, single-reader FIFO on
| modern processors without the use of memory barriers.
|
| I am not sure about this. From my understanding, on x86, given
| the absence of compiler reordering, processor reordering should
| not cause a problem for a single-reader-single-writer FIFO.
| Normally I just use atomics but I think in this specific
| instance it should still be okay anyways. Obviously it will not
| work on ARM.
|
| From my testing if you compile the code on x86 with clang or
| gcc, the resulting binary is not vulnerable.
| gpderetta wrote:
| Without compiler fences in the right place [1] GCC and clang
| can miscompile the code even on x86. Doesn't mean they will
| of course.
|
| [1] see the linux kernel implementation of load acquire and
| store release on x86 for example.
| hvdijk wrote:
| The point is that the bug is unexploitable on x86 because
| although the source code may have a bug, on x86 it gets
| compiled to machine code that does not. That's the thing with
| undefined behaviour, sometimes it does work exactly as you
| expect, which can make it so tricky to nail down.
| kloch wrote:
| > "volatile" does not mean what you think it means; if you're
| using it for anything other than interacting with hardware
| registers in a device driver, you're almost certainly using it
| incorrectly.
|
| Another "correct" use of volatile is a hack to prevent
| compilers from optimizing away certain code. It's pretty rare
| to need that and often you can just use a lower optimization
| level (like the usual -O2) but sometimes you need -O3 / -Ofast
| or something and a strategic volatile type def to keep
| everything working.
|
| A classic example is Kahan summation algorithim. At -O2 it's
| fine. At -O3 or higher it silently defeats the algorithm while
| appearing to work (you get a sum but without the error
| compensation). Defining the working vars as volatile makes it
| work again. This is noted in the wikipedia pseudocode with the
| comment "// Algebraically, c should always be zero. Beware
| overly-aggressive optimizing compilers!"
|
| https://en.wikipedia.org/wiki/Kahan_summation_algorithm
|
| Of course -O3 might not be any faster anyway but that's another
| topic.
| vlovich123 wrote:
| I can't imagine it's an O2 vs O3 thing unless a compiler
| enables "fast-math" optimization to allow associativity.
| Neither clang nor GCC do this (neither does MSVC I think) -
| optimization levels never silently turn off IEEE754 floating
| point. I don't know about ICC but it sounds like they
| stupidly enable fast math by default to try to win at
| benchmarks.
|
| Do you have anything to actually support this statement or
| did you just assume "overly aggressive optimizing compilers"
| and "O3" are somehow linked?
|
| Generally optimization levels may find more opportunities to
| exploit UB, but they do not change the semantics of the
| language, and all languages I'm familiar with define floating
| point as a non-associative operation because it's not when
| you're working with finite precision.
|
| TLDR: Don't use volatile unless you really know what you're
| doing, and unless you know C/C++ really well, you probably do
| not. If anyone tells you to throw in a volatile to "make
| things work", it's most likely cargo curling bad advice (not
| always, but probably).
| gpderetta wrote:
| there is some amount of truth on what the parent is saying.
| Ages ago, when x86 only had x87 FP, gcc would program the
| FPU to use 80 bit precision even when dealing with doubles.
| The excess precision meant that GCC could not implement
| IEEE math correctly even without fast-math. Forcing the
| storing of intermediate values into memory via volatile
| variables was a partial solution to this problem.
|
| MSVC configures the FPU to use 64 bit precision which means
| that double words fine, but it has no 80 bit long double
| and float still suffer from excess precision.
|
| SSE avoid all these problems.
| vlovich123 wrote:
| Kind of, but that still shouldn't have impacted Kahan
| summation, which only cares about associativity, and
| extended precision doesn't impact that. They would just
| end up getting more numerically accurate results on x87.
| kloch wrote:
| I did tests on Kahan summation recently on my macbook pro
| and -O3 defeated the algorithm while -O2 did not. Declaring
| the below variables as volatile restored error compensation
| with -O3.
|
| The relevant code is:
| kahan_y=g_sample_z - kahan_c;
| kahan_t=g_sample_z_sum + kahan_y;
| kahan_c=(kahan_t - g_sample_z_sum) - kahan_y;
| g_sample_z_sum=kahan_t;
|
| (this is in an inner loop where a new g_sample_z is
| calculated and then added to a running g_sample_z_sum with
| this snippet)
| vlovich123 wrote:
| Sounds like a compiler bug to me. Can you file a bug to
| clang with a reduced standalone test (or I can do it for
| you if you share the standalone test).
| [deleted]
| [deleted]
| kloch wrote:
| Here is a complete simplified Kahan summation test and
| indeed it works with -O3 but fails with -Ofast. There
| must have been something else going on in my real program
| at -O3. However the original point that 'volatile' can be
| a workaround for some optimization problems is still
| valid (you may want the rest of your program to benefit
| from -Ofast without breaking certain parts).
|
| Changing the three kahan_* variables to volatile makes
| this work (slowly) with -Ofast. #include
| <stdio.h> int main(int argc, char **argv) {
| int i; double sample, sum; double
| kahan_y, kahan_t, kahan_c; // initial values
| sum=0.0; sample=1.0; // start with "large" value
| for (i=0; i <= 1000000000; i++) { // add 1 large value
| plus 1 billion small values // Kahan summation
| algorithm kahan_y=sample - kahan_c;
| kahan_t=sum + kahan_y; kahan_c=(kahan_t - sum)
| - kahan_y; sum=kahan_t; // pre-
| load next small value sample=1.0E-20; }
| printf("sum: %.15f\n", sum); }
| hermitdev wrote:
| I hope this isn't the actual "real" code, because you've
| got undefined behavior before you even have to worry
| about the associativity optimizations. There's an
| uninitialized read of 'kahan_c' on the first loop
| iteration.
| vlovich123 wrote:
| Correct. `-Ofast` claim to fame is it enables `-ffast-
| math` which is why it has huge warning signs around it in
| the documentation. `-ffast-math` turns on associativity
| which is problematic for Kahan summation. Rather than
| sprinkling in volatiles which pessimizes the compiler to
| no end, I would recommend annotating the problematic
| function to turn off associativity [1][2].
|
| Something like: [[gnu::optimize("no-
| associative-math")]] double kahanSummation() {
| ... }
|
| That way the compiler applies all the optimizations it
| can but only turns off associative math. This should work
| on Clang & GCC & be net faster in all cases.
|
| This is what I mean by "If you're sprinkling volatile
| around, you probably aren't doing what you want" and are
| just cargo culting bad advice.
|
| [1] https://stackoverflow.com/questions/26266820/in-
| clang-how-do... [2]
| https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-
| Attrib...
| leeter wrote:
| Well their use is definitely UB as it creates data races.
| Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z
| gpderetta wrote:
| As noted elsethread, the code is indeed wrong even on x86,
| although you only need compiler fences there.
| [deleted]
| dataangel wrote:
| I think their point is you only need compiler barriers not
| actual barrier instructions on x86. volatile in practice has
| been the de facto way to get the effect of a compiler memory
| barrier for a long time even though it's not the best way to do
| it nowadays. The original purpose of it is literally preventing
| the compiler from getting rid of loads and stores and
| reordering them which is exactly what is needed when
| implementing a lockless FIFO. As long as all the stores and
| loads (including the actual FIFO payload) are volatile, it will
| work (volatile loads and stores are guaranteed to not be
| reordered with each other). After that the x86 guarantees about
| not reordering are very strong. Really the best argument
| against volatile for this kind of thing is actually the
| opposite of your point, _volatile is too strong_. It prevents
| more reordering than you actually want. Acquire /release
| semantics are less strong and give the compiler more
| flexibility.
| scatters wrote:
| Another place it's meaningful to use `volatile` is in
| benchmarking and testing: to either ensure that a block of code
| is run despite not having any side effects, or to ensure that a
| block of code that should _not_ be run is still compiled and
| emitted to binary.
|
| But yes, `volatile` for what should be atomics is a clear code
| smell. I made quite a loud noise when I read "the code quality
| looks excellent" in the article.
| silisili wrote:
| > Nowadays, high-performance processors, like those found in
| desktops, servers, and phones, are massively out-of-order to
| exploit instruction-level parallelism as much as possible. They
| perform all sorts of tricks to improve performance.
|
| Relevant quote from Jim Keller: You run this program a hundred
| times, it never runs the same way twice. Ever.
| krylon wrote:
| Heraclitus, mumbling into his beard: "Told you so!"
|
| SCNR
| vlovich123 wrote:
| A hundred times is not that much except for really cold code
| paths. It's probably in the billions if not more and I have to
| imagine that software level effects typically swamp HW-level
| effects here. That's why you see software typically having a
| performance deviation no greater than ~5-10% unless you're
| running microbenchmarks.
| Azsy wrote:
| Have i told you about our lord and savior Rust?
|
| Anyways, https://github.com/tokio-rs/loom is used by any serious
| library doing atomic ops/synchronization and it blew me away with
| how fast it can catch most bugs like this.
| nyanpasu64 wrote:
| Rust doesn't catch memory ordering errors, which can result in
| behavioral bugs in safe Rust and data races and memory unsafety
| in unsafe Rust. But Loom is an excellent tool for catching
| ordering errors, though its UnsafeCell API differs from std's
| (and worse yet, some people report Loom returns false
| positives/negatives in some cases: https://github.com/tokio-
| rs/loom/issues/180, possibly https://github.com/tokio-
| rs/loom/issues/166).
| tialaramex wrote:
| > which can result in behavioral bugs in safe Rust
|
| For example, Rust doesn't have any way to know that your
| chosen lock-free algorithm relies on Acquire-release
| semantics to perform as intended, and so if you write safe
| Rust to implement it with Relaxed ordering, it will compile,
| and run, and on x86-64 it will even _work_ just fine because
| the cheap behaviour on x86-64 _has_ Acquire-release semantics
| anyway. But on ARM your program doesn 't work because ARM
| really does have a Relaxed mode and without Acquire-release
| what you've got is not the clever lock-free algorithm you
| intended after all.
|
| However, if you don't even understand what Ordering is, and
| just try to implement the naive algorithm in Rust without
| Atomic operations that take an Ordering, Rust won't compile
| your program at all because it could race. So this way you
| are at least confronted with the fact that it's time to learn
| about Ordering if you want to implement this algorithm and if
| you pick Relaxed you can keep the resulting (safe) mess you
| made.
| CodesInChaos wrote:
| It doesn't catch all of them. But data-races on plain memory
| access are impossible in safe rust.
|
| And atomics force you to specify an ordering on every access,
| which helps both the writer (forced to think about which
| ordering they need) and reviewer (by communicating intent).
| Fiahil wrote:
| I think it's fixable, the main reactor is what matters. You
| can add or remove as many synchronisation primitive as you
| like.
|
| Other tooling, like Jepsen, will interact with your program
| at a higher level.
| im3w1l wrote:
| And arm-windows will (does already?) run x86 binaries with weaker
| memory ordering than they were written for. So this could be a
| real thing soon.
| belter wrote:
| Now I am worried. Do you have a reference please?
| im3w1l wrote:
| Best I could find. It's not a great reference because it
| doesn't give any details but it does prove that it's a thing.
| https://docs.microsoft.com/en-us/windows/uwp/porting/apps-
| on...
| kevingadd wrote:
| Are you sure the translators don't insert code necessary to
| maintain ordering? I would be shocked if most threaded code
| works when you throw out the x86 memory model. Managed runtimes
| like .NET definitely generate code for each target designed to
| maintain the correct memory model.
| pmuderoc wrote:
| They better do, but then, how would an automatic translator
| know that this is a "release semantics" atomic store
| operation?
|
| Because on x86 it is, no special barriers or instructions
| necessary.
|
| mov [shared_data], 1
|
| mov [release_flag], 1
| my123 wrote:
| It's pessimistic and converts over a lot of memory accesses
| to RCpc or atomics.
|
| (on ARMv8.0 where you don't have those, barriers are used
| more)
|
| TSO pessimization is the only way to make the thing work at
| a translation time cost that isn't too high.
| gpderetta wrote:
| Or you support TSO directly on your cpu like Apple does
| on M1.
| tsimionescu wrote:
| Sure, but Windows on ARM has to run on many ARM
| processors, not a specific one designed by MS. They could
| detect if the processor has non-standard TSO support and
| use that when running an x86 app, but they still have to
| do something to run the x86 app on a standard ARM
| processor.
| my123 wrote:
| Maintaining the memory model guarantees is what causes the
| steep cost in performance when using x86 apps on Windows on
| Arm.
|
| That said, heuristics are used to speed it up. I would
| recommend not sharing values in the stack between threads for
| synchronisation for example.
| nyanpasu64 wrote:
| https://docs.microsoft.com/en-us/windows/uwp/porting/apps-
| on...
|
| > You can also select multi-core settings, as shown here...
| These settings change the number of memory barriers used to
| synchronize memory accesses between cores in apps during
| emulation. Fast is the default mode, but the strict and very
| strict options will increase the number of barriers. This
| slows down the app, but reduces the risk of app errors. The
| single-core option removes all barriers but forces all app
| threads to run on a single core.
|
| https://news.ycombinator.com/item?id=28732273
|
| zamadatix's interprets this as Microsoft saying that by
| default, Windows on ARM runs x86 apps _without_ x86 TSO, and
| turns on extra memory barriers using per-app compatibility
| settings. But if an app needs TSO but isn 't in Windows's
| database, it will crash or silently corrupt data.
| xxs wrote:
| Normally the code should have all the needed memory fences as
| if running on DEC Alpha, e.g. linux does that, and the
| compilers omit the unneeded ones.
| monocasa wrote:
| And since the compiler omitted it on x86, an x86 emulator
| doesn't have access to where they're required as seen by the
| compiler.
| xxs wrote:
| emulator would have a zero issue, if it's a direct transfer
| for assembly (not an emulator), it'd need either hardware
| support - e.g. apple chips, or memory barriers.
|
| The differences between arm and x86 are known for 15y+,
| there is nothing new about it. Also concurrency support is
| one of the major benefits of languages with proper memory
| model - java started it with JMM[0]
|
| [0]: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCh
| eckedL...
| monocasa wrote:
| The differences are very known, it's just still an open
| problem how to address running code for a stronger memory
| model on systems with a weaker one at as high as possible
| performance performance without explicit hardware support
| (like Apple's choice of a TSO config bit). Your compiled
| binary for TSO has erased any LoadLoad, LoadStore, and
| StoreStore barriers, and the emulator has to divine them.
| The heuristics there are still fraught with peril.
|
| The JVM absolutely did some great work walking this path,
| both in defining a memory model in the first place, and
| supporting that model on weak and strong hardware memory
| models, but the JMM was specifically designed to be able
| to run cleanly on WMO platforms to begin with (early
| Sparc), so they don't face a lot of the same problems
| discussed here.
| gpderetta wrote:
| Any emulator that wants to be remotely performance
| competitive will do dynamic translation (i.e JIT). In
| fact ahead-of-time translation is not really feasible.
|
| Memory models and JVM are not really relevant when
| discussing running binaries for a different architecture.
| secondcoming wrote:
| Doesn't the JVM define its own memory model?
| gpderetta wrote:
| Sure, but how's that relevant when discussing running x86
| binaries on ARM?
| xxs wrote:
| Ah well, back in the day, prior JMM, there was no notion
| of memory models at all, and no language had one; Hence,
| I referenced the original paper that started it all. The
| point was that it happened long time ago, and there is
| nothing new about the current case.
| secondcoming wrote:
| Sorry, I thought you were talking about running JVM
| binaries on different architectures
| [deleted]
| xxs wrote:
| The memory models are relevant as the
| translation/JIT/whatever has to take them in
| consideration. TSO is well known and it's also well known
| how arm weaker memory model needs memory barriers to
| emulate TSO.
|
| If there is a JIT I'd expect to be able to add a read
| barriers, on memory location allocated by another thread
| - incl. allocating bits in the pointers and masking them
| off on each dereference. If any block appears to be
| shared - the code that allocated it would need to be
| recompiled with memory store-store barriers; the reading
| part would need load-load and so on. There are quite a
| few ways to deal with the case, aside the obvious - make
| the hardware compatible.
|
| If in end it's not an easy feat to make up for the
| stronger memory model correctness, yet correctness should
| be a prime goal of an 'emulator'
| stonemetal12 wrote:
| It is relevant as an example of how to write a JIT for a
| specific memory model that will run on a different
| architecture with a different memory model. AKA it is a
| known issue that has been successfully dealt with for
| quite a while now.
___________________________________________________________________
(page generated 2021-10-26 23:02 UTC)