Subj : Re: Memory Barriers, Compiler Optimizations, etc.
To   : comp.programming.threads
From : Joseph Seigh
Date : Thu Feb 03 2005 01:47 pm

On Thu, 3 Feb 2005 09:01:24 -0800, Scott Meyers <Usenet@aristeia.com> wrote:

> On Wed, 02 Feb 2005 07:43:15 -0500, Joseph Seigh wrote:
>> On Tue, 1 Feb 2005 20:38:08 -0800, Scott Meyers <Usenet@aristeia.com> wrote:
>> > Assuming that x, y, a, and b are all distinct locations, is it reasonable
>> > to assume that no compiler will move the assignment to y above the barrier,
>> > or is it necessary to declare x and y volatile to prevent such code motion?
>>
>> Hypothetically, yes.  Volatile wouldn't help as it has no meaning for
>> threads.  If the variables are only known to the local scope, ie. they're
>> not external or have had an address taken, then the compiler can move them
>> whereever it wants since no other thread can see them.
>
> My concern wrt volatile was that treatments of memory issues refer to
> "program order" as if it's the same as "source code order," but with
> compilers moving stuff around prior to code generation, "source code order"
> may be quite different from "program order."  At least in C++, if I want to
> ensure that the the relative order of these reads is preserved,
>
>   x = a;    // I want x to be read before y
>   y = b;
>
> declaring x and y volatile will do it.  Compilers can still move the reads
> around wrt reads and writes of non-volatile data, but to remain compliant
> with the C++ standard, x must be read before y in the generated code, i.e.,
> in program order.

I guess.  I'm not real familiar with volatile since it's not that useful
in threading.  If expressions are sequence points then that should make
every statement a sequence point also.
>
> However, if compilers recognize and respect the semantics of membars, the
> need for volatile goes away, because I can just stick a membar between the
> reads (which I need anyway), and the problem is solved.

AFAIK they don't, so we have to use the ad hoc solutions that we use
now.

>
> Incidently, I understand how compiler intrinsics like Microsoft's
> _ReadWriteBarrier are recognized by compilers, but from what I've read in
> this group, there seems to be the assumption that calling an externally
> defined function containing assembler will prevent code motion across
> calls to the function, because compilers must pessimistically assume that
> calls to the function affect all memory locations.  With increasingly
> aggressiving cross-module inlining technology available, this seems like a
> bet that gets worse and worse with time.  It's not hard to imagine a build
> system that can see that a called function doesn't affect the value of a
> global variable and thus move a read or write of that variable across the
> call.  Is there a reason this can't happen, or are we just lucky that our
> tools are, for the time being, both conservative and kind of dumb?

The latter.  We're just lucky for now.  There seems to be extreme
antipathy towards threading issues in the C community at least.  Try
to ask any thread specific questions in the C newsgroups at least and
you get "C has nothing to do with threads" response.  There's less of
that in the C++ newsgroups now since Herb Sutter, Andrei Alexandrescu,
and yourself maybe, have picked up on and started promoting threading.

For example, I never got any authoritative response as to why Linux
assumes int loads and stores are atomic in ia32.  Apparentlly it's
either some undocumented communication somewhere or, more likely,
someone is just assuming that since gcc does atomic load/store of int
for every case they've observed, it must do so for all cases.

It sort of the same for separately compiled external functions.  You assume
that the compiler has to drop optimization for any variable that has
had its address gotten from or passed to an external routine, or has
the external attribute.  It could break at some point and we'll have
to start writing all the synchronization functions in external assembler
programs.  That will make memory barriers more expensive than they already are.

It's not just C and C++ you have to worry about.  Hardware architects have
even less of a clue about multi-threading than compiler writers.  Their
sophistication ends at using a test and set to implement a lock.  They
have no notion of how people are actually doing concurrent programming.
With the use of RCU (Read Copy Update) in the Linux kernal, they've
adopted the use of load dependent memory barriers to avoid the more
expensive load fence memory barriers.  The load dependent memory barriers
aren't part of any architected memory model, so hardware architects
definitely are not aware that they're being used.  It's a distinct
possibility that some hardware vendor will break it, much to their
detriment in the marketplace.  There's a pseudo-op in Linux for this
so they can put in a real memory barrier if needed.  Currently alpha
processors don't support dependent load memory ordering.  There's was
a discussion on this in Linux kernel mailing list back during the
implementation of RCU in Linux, but there's no explicit documentation
that will carry forward.


-- 
Joe Seigh

.