Subj : Re: Memory visibility and MS Interlocked instructions
To   : comp.programming.threads
From : Seongbae Park
Date : Thu Sep 01 2005 07:09 pm

Peter Dimov <pdimov@gmail.com> wrote:
> Sean Kelly wrote:
>> Alexander Terekhov wrote:
>> > Joe Seigh wrote:
>> > [...]
>> > > I'm finding it impossible to argue with a moving target.  If I subtract
>> > > everything you say out, it pretty much sounds like ia32 loads are not
>> > > always guaranteed to be "in-order".
>> >
>> > They are always "in-order" with respect to other loads and subsequent
>> > stores. What you can't grok is that ia32 *stores* (being PC stores;
>> > i.e. RCpc release stores) are not constrained to ensure "remote write
>> > atomicity" (in IA64 formal memory model speak).
>>
>> Okay now you've got me confused because it sounds like you're arguing
>> Joe's case.  If IA-32 stores do not "become remotely visible to all
>> processors in the same order" then the assertion that all stores have
>> release semantics is only true at a processor level, which would imply
>> that a membar is required to globally order writes.  Thus, msync.acq
>> and msync.rel operations (to use atomic<> semantics) would both need
>> the LOCK prefix.  Is this correct?
> 
> ld.acq and st.rel do not guarantee total store ordering. If you have
> 
> X = 0, Y = 0
> 
> CPU1:
> 
> st.rel X 1
> ld.acq Y
> 
> CPU2:
> 
> st.rel Y 1
> ld.acq X
> 
> It is possible for CPU1 and CPU2 to both load 0. This can't happen in a
> TSO model (*), because one of the two stores must execute first.
> --
> (*) I don't know for sure whether SPARC-TSO is really TSO, though.

You'd better define what TSO is then.
I'm not aware of any memory model called TSO other than what's defined in SPARC.
There's no such thing as "release" and "acquire" in SPARC's TSO.

Under TSO, the following example (essentially yours 
with rel/acq stripped since they don't exist on TSO):

Initially X=Y=0
P1: st 1,X; ld Y,reg1
P2: st 1,Y; ld X,reg2

Under TSO, reg1==0 AND reg2==0 is still possible,
because those loads can perform before stores.

But I think your general idea that 
the store atomicity (a.k.a. global ordering of store)
is the difference between PC and TSO is correct.
It's just that your example, as it is,
does not demonstrate the store atomicity problem due to other reasons.
And if you search through this thread, 
one of my previous postings had an example that can
distinguish between PC (hence special accesses of RCpc) and TSO,
or in other words that can tell 
whether the memory model guarantees store atomicity or not.
-- 
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

.