Subj : Re: Consistency of a byte shared variable
To   : comp.programming.threads
From : Gianni Mariani
Date : Mon Jan 31 2005 12:37 am

theepan wrote:
> Hi all,
> Does a shared byte variable is consistent in its state(without any
> sort of locking ), among threads? 

on some cpu's yes, on most modern cpus, no.

e.g.

int	value=1; // global value.

thread1: read value (sees 1)
thread1: write value ( value +1 )
thread2: read value ( sees either 1 or 2 depending on timing)
thread2: write value ( either 2 or 3 depending on earlier read)

bad...


thread1: read value (sees 1)
thread1: write value ( value +1 (2) )
thread1: "release memory barrier"
thread2: "aquire memory barrier"
thread2: read value ( sees 2 )
thread2: write value ( writes value +1 (3) )

better

x86, amd64 and MIPS (I think) cpus "currently" seem to be cache coherent 
where the release and aquire are managed in hardware.  Sparc, PowerPC 
and Itanium seem to require explicit barrier management.  I'm sure 
someone will correct me and fill in the details here.

Most newer CPU's have multiple execution units, register renaming and 
some have "speculative" execution with "write queues" and some have very 
deep pipelines (P4).  This all means that the order in which 
instructions are executed and values written to memory is determined at 
run-time and can change dynamically.  The CPU is designed to "appear" 
like the instructions are being executed in sequence.  Hence, data is 
written in almost random order.  Some CPU's don't not take into account 
conflicts that arise with the re-ordering when there are multiple CPUs 
sharing the same memory, hence the whole idea of "memory barriers". 
Barriers are instructions that will provide a "sequence point" where the 
CPU will ensure that the memory operations that occur before the 
sequence point will be "visible" before any memory operations that occur 
after the sequence point.  These are usually flavoured as "read" 
barriers and "write" barriers.  The nomenclature here gets fuzzy, there 
seems to be slight differences in semantics between the different 
systems but this is a list of some of terminology that I've seen so far.

write barriers:
release
producer
sfence
sink
store

read barriers:
aquire
consumer
lfence
hoist
load

Specifically, write barriers ensure that all writes to memory before the 
barrier instruction are made visible to other CPU's before any writes 
after the barrier.  Similarly, read barriers ensure that any reads from 
memory before the memory barrier occur before any reads after the memory 
barrier.

The "cost" of the barrier instruction is system dependant, and somewhat 
code dependant as well.  No doubt as CPUs become even more complex, the 
relative cost (in number of potential lost instuctions) could go up.  I 
have no data directly related to the cost of barrier instructions, 
however, latentcy from CPU to CPU (making a memory change visible from 
one CPU to another) can vary from the 100's to 1000's of clock cycles. 
For example, my dual AMD 2400 Athlon MP has a latentcy of around  900 
cycles. A dual Opteron 248 clocks in at around 450 cycles.  I suspect 
that each CPU architecture has different limitations and trade-offs and 
so this can vary significantly from CPU to CPU even in the same family 
of CPU's.

Also, compilers will re-order operations, which usually means that you 
need to be careful using "volatile" or other system dependent tools to 
avoid the "optimizer" re-ordering your carefully crafted code.

If you use a mutex to synchronize access to memory a memory barrier is 
implicitly done (otherwise the mutex would probably not work very well), 
hence there is no need to perform memory barriers outside of the regular 
mutex lock/unlock.  However, this does become important when using the 
DCL or "Double Checked Lock".

*(Broken DCL)*

static volatile int y;
static volatile int v;

if ( v == 0 )
{
    lock( mutex );
    if ( v == 0 )
    {
        y = boo();
        v = 1;
    }
    unlock( mutex );
}

// y might not be the return value from
// boo() in a different thread than the caller of boo().
do_somthing( y );

There are a number of ways to "fix" this, one is simply use the mutex 
all the time:

*(Eliminate DCL)*

static volatile int y;
static volatile int v;

lock( mutex );
if ( v == 0 )
{
    y = boo();
    v = 1;
}
unlock( mutex );

do_somthing( y );

Or introduce memory barriers.

*(DCL with memory barriers)*

static volatile int y;
static volatile int v;

int local_v = v;
read_sync();

if ( local_v == 0 )
{
    lock( mutex );
    if ( v == 0 )
    {
        y = boo();
        write_sync();
        v = 1;
    }
    unlock( mutex );
}

do_somthing( y );

I think the one technique below is the "fastest" mechanism and is 
probably only supported in some compilers.  The "__thread" keyword is 
supported by some compilers as an easy way to access storage that is 
local to a thread.  In other words, these are "thread private" or TLS 
"Thread Local Storage" or "TSS" (Thread Specific Storage) or "TLD" 
(Thread Local Data) or "TSD" (Thread Specific Data) - pick one...

*(TSS DCL)*

static volatile int y;
static volatile int v;
static __thread private_v;

if ( private_v == 0 )
{
    lock( mutex );
    if ( v == 0 )
    {
        y = boo();
        write_sync();
        v = 1;
    }
    unlock( mutex );

    private_v = v;
}

do_somthing( y );

This is only fast if the compiler generates simple read instructions for 
private_v == 0.  I was able to confirm this using gcc on x86 and amd64 
targets.

If your compiler does not have fast thread local data, then I suspect 
the DCL with memory barriers might be a better option.

.