Subj : Re: Memory visibility and the C compiler... ( a bit long )
To   : comp.programming.threads
From : SenderX
Date : Wed Jan 12 2005 06:52 pm

Fu@#$%in outlook inserted a bunch of double-spaces making the document hard 
to read. Let me try posting it one more time. Crossing-fingers...

:O




Excerpt from Section 2 ( rough-draft )

****************************************************


2 Memory Visibility and the C Compiler
===========

  How does the C language relate to threads? The simple and correct answer
is: They are not related in anyway, shape, or form. If you try to implement
most lock-free algorithms in C you're exposing yourself to a number of
issues. Virtually all lock-free algorithms rely on their instructions being
executed in a precise order. If the ordering of "critical instructions" is
messed with by the complier in any way, get ready for combating weird and
really hard to find race-conditions. It must also be noted that these
critical instructions can not only be reordered by the compiler, they can be
affected in adverse ways by the processor as well. Luckily, the processor's
instruction set provides us with what we need to insure the ordering of
instructions and the effects they have on memory. These instructions are
commonly called "memory barriers". How does the AppCore library attempt to
shield its critical instructions from compiler and/or processor reordering?


2.1 Assembly Language
===========

  Externally compiled functions written in 100% assembly language can help
alleviate the problems associated with the compiler and/or processor
reordering "critical instructions". Since an assembler will not reorder our
instructions, we can be assured that everything will be as expected. The
ordering of instructions will be correct and the memory barriers will stay
exactly where we put them. For simplicity, this paper uses the well know
i686 instruction-set and assumes the basic calling convention used by
virtually all existing 80x86 C compilers. Function parameters are pushed on
the stack in reverse-order. Then the call instruction is executed which
pushes the address of the next instruction on the stack and jumps to the
functions entry point. The registers esi, ebx, and edi shall be preserved
and all push instructions shall have an equal number of pop instructions
before the function can safely return. A function call is returned by using
the ret instruction which pops the address of the next instruction off the
stack and jumps to it. The return value shall be located in eax.

  All C-like pseudo-code in this paper is assumed to be executed in the
exact order shown, and use the calling convention described above. This
paper also contains assembly language snippets. These snippets conform to
the described calling convention. The ordering of their instructions is
enforced by memory barrier instructions and the assembler.


2.2 Memory Barriers
===========

  These special instructions are used to synchronize access to shared data.
Weak memory models coupled with multiple processors means that one
processors modification to shared memory may or may not be fully visible to
another processor. One processor might need wait until its affects on shared
memory are fully applied before it executes is next instruction; a memory
barrier would be used to do the waiting. Another processor might need to
ensure its ok to read shared data my synchronizing it view of the memory
with the processor that updated it. Again a memory barrier would be used to
wait. As you can see, a sort of producer/consumer relationship is exposed.
Memory barriers are used on both the producer side and the consumer side.

  Some processors offer a number of different memory barriers. Some affect
only loads. Others may only affect stores. Some impose ordering on loads and
stores. You need to choose the correct barriers for a particular algorithm.
A standard mutex needs barriers that affect both loads and stores. A
lock-free single read-write queue needs a barrier that only affects stores
for producers and loads for consumers. It basically comes down to
performance. The less restrictive the barriers are the faster an algorithm
will perform. The AppCore library attempts to provide a fairly fine-grain
set of barriers that can lead to optimal results on certain processors by
exposing the following externally assembled routines that implement its
Memory Barrier API:


- ac_cpu_mb_producer - producer side barrier
- ac_cpu_mb_consumer - consumer side barrier
- ac_cpu_mb_consumer_dep - consumer side barrier /w data-dependency
- ac_cpu_mb_full - fenced barrier
- ac_cpu_mb_acq - acquire side barrier
- ac_cpu_mb_acq_dep - acquired side barrier /w data-dependency
- ac_cpu_mb_rel - released side barrier


/* i686 producer barrier */
extern void
ac_cpu_i686_mb_producer( void );


/* i686 consumer barrier */
extern void
ac_cpu_i686_mb_consumer( void );


/* i686 full barrier */
extern void
ac_cpu_i686_mb_full( void );


/* define compatible i686 memory barrier API */
#define ac_cpu_mb_producer ac_cpu_i686_mb_producer
#define ac_cpu_mb_consumer ac_cpu_i686_mb_consumer
#define ac_cpu_mb_consumer_dep
#define ac_cpu_mb_full ac_cpu_i686_mb_full
#define ac_cpu_mb_acq ac_cpu_mb_full
#define ac_cpu_mb_acq_dep ac_cpu_mb_full
#define ac_cpu_mb_rel ac_cpu_mb_full


/* i686 producer barrier implementation */
..globl ac_cpu_i686_mb_producer
ac_cpu_i686_mb_producer:
  sfence
ret


/* i686 consumer barrier implementation */
..globl ac_cpu_i686_mb_consumer
ac_cpu_i686_mb_consumer:
  lfence
ret


/* i686 full barrier implementation */
..globl ac_cpu_i686_mb_full
ac_cpu_i686_mb_full:
  mfence
ret


  I will now try to briefly explain the semantics of two well known memory
visibility models:


2.2.1 Producer/Consumer Model
===========

  Imagine if a processor "A" prepares to "produce" an object. After it
creates the object it then points an initially null shared pointer to it,
indicating the object is completed and ready to be "consumed". Another
processor "B" reads the shared pointer and notices that it's non-null and
begins to access the pointed-to object:


/* a simple example of broken pseudo-code */

static volatile object_t *shared = 0;


/* processor A */
object_t *local = create_object( . );
shared = local;


/* processor B */
object_t *local = shared;
if ( local ) { use_object( local ); }


  Does processor B have a clear view of the object? The answer should always
be: No. Processor A naively assumed that the object it's producing will be
fully visible to a consumer. Think if the stores to the shared object were
not executed until after Processor B consumed the data. This would be a
massive race-condition that could corrupt the entire application. This type
of behavior is simply unacceptable. How do we resolve this issue? The first
part of the solution is to have processor A wait for the stores to be fully
completed "before" it sets the shared pointer. We have solved the "producer"
side, how do we solve the consumer side? After Processor B reads the shared
pointer it needs to be sure that its synchronized with Processor A's
producer barrier:


/* the fixed example pseudo-code */
static volatile object_t *shared = 0;


/* processor A */
object_t *local = create_object( . );
ac_cpu_mb_producer();
shared = local;


/* processor B */
object_t *local = shared;
ac_cpu_mb_consumer_dep();
if ( local ) { use_object( local ); }


  Now Processor B is guaranteed to see a complete object. It's all about the
consumer waiting for the producer to finish after it's notified, and the
producer finishing up before it notifies a consumer. This basic
producer/consumer relationship between processors is an essential part of
memory visibility.


2.2.2 Acquire/Release Model
===========

  This is the extremely strict memory model that everybody's use to. Anyone
who has used mutexs has been conforming to acquire/release memory visibility
semantics all along. It a more stringent version of the producer/consumer
model where the unlocking of a mutex produces data and the locking consumes
it. However, acquire/release needs to affect both load and stores on the
consumer and producer sides. Basically, the model goes something like this:
The act of locking a mutex executes an acquire barrier "after" the lock's
state is atomically set to a value indicating to other threads that it's
locked. This means that no stores from after the barrier will reach
visibility, and no loads from after the barrier will be applied, "before"
the lock acquisition reaches global visibility. The act of unlocking a mutex
executes a release barrier before the lock's state is atomically set to a
value indicating to other thread that it's unlocked. This means that all
loads and stores before the barrier will be applied before the subsequent
lock release reaches global visibility.


2.3 Atomic Operations
===========

  Sometimes a thread needs to execute a read-modify-write cycle without
being interrupted. Most processors provide special instructions that take
whatever measures necessary to provide this functionality. They are commonly
called atomic operations. The word atomic implies indivisibility and
irreducibility. So it fits the application well. Since this paper is using
i686 all of the atomic operations listed in this section use the lock
instruction. You prefix lock to an instruction that needs to be executed
atomically with regard to multi-processors. This ensures the instruction is
executed atomically with a full memory barrier before the next instruction
is executed. Intel's atomic operations make it so a programmer doesn't have
to add the memory barriers by hand. This can limits their flexibility and
performance, but it's the way things are done on x86.

  All of AppCore's algorithms can be implemented with simple atomic
operations, all of which can be implemented with normal compare-and-swap (
CAS ) or load-linked store-conditional ( LL/SC ) instructions.  The library
can also be configured to make certain algorithms use a double-width
compare-and-swap ( DWCAS ) instruction. This gives us the ability to
atomically modify both a pointer and an adjacent word. DWCAS can be used to
implement much more complex and flexible lock-free algorithms. The AppCore
library exposes the following types and externally assembled routines that
implement its Atomic API:


Types
===========

- ac_atomic_t - signed word-sized variable.
- ac_uatomic_t - unsigned word-sized variable.


Fence
===========

- ac_atomic_cas_full - fenced CAS
- np_ac_atomic_dwcas_full - fenced DWCAS
- ac_atomic_cas_full_naked - fenced CAS /w naked failure
- np_ac_atomic_dwcas_full_naked- fenced DWCAS /w naked failure
- ac_atomic_xchg_full - fenced exchange
- ac_atomic_xadd_full - fenced exchange-and-add
- ac_atomic_inc_full - fenced increment
- ac_atomic_dec_full - fenced decrement


Acquire
===========

- ac_atomic_cas_acq - acquired CAS
- np_ac_atomic_dwcas_acq - acquired DWCAS
- ac_atomic_cas_acq_naked - acquired CAS /w naked failure.
- np_ac_atomic_dwcas_acq_naked - acquired DWCAS /w naked failure
- ac_atomic_xchg_acq - acquired exchange
- ac_atomic_xadd_acq - acquired exchange-and-add
- ac_atomic_inc_acq - acquired increment
- ac_atomic_dec_acq - acquired decrement


Release
===========
- ac_atomic_cas_rel - released CAS
- np_ac_atomic_dwcas_rel - released DWCAS
- ac_atomic_cas_rel_naked - released CAS /w naked failure.
- np_ac_atomic_dwcas_rel_naked - released DWCAS /w naked failure
- ac_atomic_xchg_rel - released exchange
- ac_atomic_xadd_rel - released exchange-and-add
- ac_atomic_inc_rel - released increment
- ac_atomic_dec_rel - released decrement




2.3.1 Compare-and-Swap ( CAS )
===========

  This function takes three parameters: A destination pointer to a
word-sized variable, a value to compare it against, and a new value to
replace it with. The function atomically checks if the pointed to
destination variable are equal to the comprand. If they are equal, the
pointed to variable is replaced with the new value. If they are not equal,
there is no effect. The function always returns the original value of the
pointed to variable.


/* i686 fenced CAS */
extern ac_atomic_t
ac_cpu_i686_atomic_cas_full
( volatile ac_atomic_t *dest,
  ac_atomic_t cmp,
  ac_atomic_t xchg );


/* define a compatible i686 CAS API */
#define ac_atomic_cas_full ac_cpu_i686_atomic_cas_full
#define ac_atomic_cas_acq ac_atomic_cas_full
#define ac_atomic_cas_rel ac_atomic_cas_full
#define ac_atomic_cas_full_naked ac_atomic_cas_full
#define ac_atomic_cas_acq_naked ac_atomic_cas_full
#define ac_atomic_cas_rel_naked ac_atomic_cas_full


/* i686 fenced CAS implementation */
..globl ac_cpu_i686_atomic_cas_full
ac_cpu_i686_atomic_cas_full:
  movl 4(%esp), %ecx
  movl 8(%esp), %eax
  movl 12(%esp), %edx
  lock cmpxchgl %edx, (%ecx)
ret


2.3.2 Double-Width Compare-and-Swap ( DWCAS )
===========

  This function takes three parameters: A pointer to a destination
doubleword-sized variable, a pointer to a value to compare it against, and a
pointer to a new value. The function atomically checks if the destination is
equal to the comprand. If they are equal, the destination is replaced with
the comprand. If they are not equal, the comprand is updated with the
destination value. The function returns non-zero if the operation failed.


/* i686 fenced DWCAS */
extern int
np_ac_cpu_i686_atomic_dwcas_full
( volatile void *dest,
  void *cmp,
  const void *xchg );


/* define a compatible i686 DWCAS API */
#define np_ac_atomic_dwcas_full np_ac_cpu_i686_atomic_dwcas_full
#define np_ac_atomic_dwcas_acq np_ac_atomic_dwcas_full
#define np_ac_atomic_dwcas_rel np_ac_atomic_dwcas_full
#define np_ac_atomic_dwcas_full_naked np_ac_atomic_dwcas_full
#define np_ac_atomic_dwcas_acq_naked np_ac_atomic_dwcas_full
#define np_ac_atomic_dwcas_rel_naked np_ac_atomic_dwcas_full


/* i686 fenced DWCAS implementation */
..globl np_ac_cpu_i686_atomic_dwcas_full
np_ac_cpu_i686_atomic_dwcas_full:
  pushl %esi
  pushl %ebx
  movl 16(%esp), %esi
  movl (%esi), %eax
  movl 4(%esi), %edx
  movl 20(%esp), %esi
  movl (%esi), %ebx
  movl 4(%esi), %ecx
  movl 12(%esp), %esi
  lock cmpxchg8b (%esi)
  movl $0, %ebx
  je np_ac_cpu_i686_atomic_dwcas_full_success
  movl 16(%esp), %esi
  movl %eax, (%esi)
  movl %edx, 4(%esi)
  movl $1, %ebx

np_ac_cpu_i686_atomic_dwcas_full_success:
  movl %ebx, %eax
  popl %ebx
  popl %esi
ret


2.3.3 Exchange-and-Add ( XADD )
===========

  This function takes two parameters: A pointer to a word-sized variable,
and a value to increment it by. It then atomically increments the pointed to
variable by the passed value. The function always returns the original value
of the pointed to variable.


/* i686 fenced XADD */
extern ac_atomic_t
ac_cpu_i686_atomic_xadd_full
( volatile ac_atomic_t *dest,
  ac_atomic_t xchg );


/* define a compatible i686 XADD API */
#define ac_atomic_xadd_full ac_cpu_i686_atomic_xadd_full
#define ac_atomic_xadd_acq ac_atomic_xadd_full
#define ac_atomic_xadd_rel ac_atomic_xadd_full


/* i686 fenced XADD implementation */
..globl ac_cpu_i686_atomic_xadd_full
ac_cpu_i686_atomic_xadd_full:
  movl 4(%esp), %ecx
  movl 8(%esp), %eax
  lock xaddl %eax, (%ecx)
ret


2.3.4 Exchange ( XCHG )
===========

  This function takes two parameters: A pointer to a word-sized variable,
and a value to exchange it with. Then the operation atomically replaces the
pointed to variable by the passed value. The function always returns the
original value of the pointed to variable.


/* i686 fenced XCHG */
extern ac_atomic_t
ac_cpu_i686_atomic_xchg_full
( volatile ac_atomic_t *dest,
  ac_atomic_t xchg );


/* define a compatible i686 XCHG API */
#define ac_atomic_xchg_full ac_cpu_i686_atomic_xchg_full
#define ac_atomic_xchg_acq ac_atomic_xchg_full
#define ac_atomic_xchg_rel ac_atomic_xchg_full


/* i686 fenced XCHG implementation */
..globl ac_cpu_i686_atomic_xchg_full
ac_cpu_i686_atomic_xchg_full:
  movl 4(%esp), %ecx
  movl 8(%esp), %eax
  xchgl %eax, (%ecx)
ret


2.3.5 Increment and Decrement
===========

  These functions take a single parameter: A pointer to a word-sized
variable. The pointed to variable is atomically incremented or decrement by
one, and the new value is returned.



/* i686 fenced increment */
extern ac_atomic_t
ac_cpu_i686_atomic_inc_full
( volatile ac_atomic_t *dest );


/* i686 fenced decrement */
extern ac_atomic_t
ac_cpu_i686_atomic_dec_full
( volatile ac_atomic_t *dest );


/* define a compatible i686 increment/decrement API */
#define ac_atomic_inc_full ac_cpu_i686_atomic_inc_full
#define ac_atomic_inc_acq ac_atomic_inc_full
#define ac_atomic_inc_rel ac_atomic_inc_full
#define ac_atomic_dec_full ac_cpu_i686_atomic_dec_full
#define ac_atomic_dec_acq ac_atomic_dec_full
#define ac_atomic_dec_rel ac_atomic_dec_full


/* i686 fenced increment implementation */
..globl ac_cpu_i686_atomic_inc_full
ac_cpu_i686_atomic_inc_full:
 movl 4(%esp), %ecx
 movl $1, %eax
 lock xaddl %eax, (%ecx)
 incl %eax
ret


/* i686 fenced decrement implementation */
..globl ac_cpu_i686_atomic_dec_full
ac_cpu_i686_atomic_dec_full:
 movl 4(%esp), %ecx
 movl $-1, %eax
 lock xaddl %eax, (%ecx)
 decl %eax
ret

.