Subj : Re: Memory visibility and MS Interlocked instructions
To   : comp.programming.threads
From : David Hopwood
Date : Sun Aug 28 2005 08:32 pm

Alexander Terekhov wrote:
> David Hopwood wrote:
> [...]
> 
>>then what's the point of the lfence instruction? 
> 
> SSE* fences are meant to control out-of-order SSE* writes of strings 
> (sfence/mfence)

That much makes sense. For simplicity, let's exclude string writes and
anything that changes cache behaviour from the usual defaults. Let's also
focus exclusively on what is visible to programs and not what happens on
the system bus.

> and disable speculation (to cache stuff in order) of loads (lfence/mfence).

Then I'm still confused.

Stores by each processor occur in program order, that's clear. You're saying
that stores made by processor 1 can nevertheless be loaded by processor 2
out of program order. I see that there could be memory models and
implementations for which this is possible, e.g. due to speculation.
But how is it consistent with saying that all loads have acquire semantics?

Example. Start with x == y == 0.

Processor 1:
   a) x := 1
   b) y := 1

Processor 2:
   c) i := y
   d) j := x

For a processor ordering model in which loads have acquire semantics,
{i == 1, j == 0} is not possible. If the effects of speculation can be
visible and need to be inhibited by an lfence between c) and d), then
this outcome is possible. Which is it for IA-32?

(And if anyone knows, which for AMD64 and EM64T?)

Some Googling turned up this description of the PPro (caveat: from 1997)
by Mike Haertel of Intel:
<http://mail-index.netbsd.org/tech-kern/1997/05/06/0000.html>

# The Pentium Pro's memory ordering model is called "processor ordering"
# and is a formalization of the 486's semantics.  The 486 had
# a write-through cache with write queue to memory which was
# not snooped by loads on other processors.
#
# Loosely speaking, this means the ordering of events originating
# from any one processor in the system, as observed by other processors,
# is always the same.  However, different observers are allowed
# to disagree on the interleaving of events from two or more processors.
#
# The PPro does speculative and out-of-order loads.  However,
# it has a mechanism called the "memory order buffer" to ensure
# that the above memory ordering model is not violated.  Load
# and store instructions do not get retired until the processor
# can prove there are no memory ordering violations in the actual
# order of execution that was used.  Stores do not get sent to
# memory until they are ready to be retired.  If the processor
# detects a memory ordering violation, it discards all unretired
# operations (including the offending memory operation) and
# restarts execution at the oldest unretired instruction.
#
# i.e. when a violation is detected the MOB whacks the machine ... :-)

Which is all fine, but if speculative loads have no effect on the memory
model, I still have no idea what the point of lfence is. Unless it is only
needed when the memory ordering is weakened using MTRRs etc.?

> SSE* stuff and ordering observable on "system 
> bus" aside for a moment, the x86 memory model (processor consistency) 
> did't change since 486. See also Intel Itanium Architecture Software 
> Developer's Manual 6.3.4: "IA-32 instructions are mapped into the 
> Itanium memory ordering model as follows...".

OK, but I'm skeptical about relying on that, because it documents an
implementation of IA-32 that *could* have stronger ordering guarantees
than IA-32 itself.

-- 
David Hopwood <david.nospam.hopwood@blueyonder.co.uk>

.