Subj : Re: Consistency of a byte shared variable To : comp.programming.threads From : Gianni Mariani Date : Mon Jan 31 2005 12:37 am theepan wrote: > Hi all, > Does a shared byte variable is consistent in its state(without any > sort of locking ), among threads? on some cpu's yes, on most modern cpus, no. e.g. int value=1; // global value. thread1: read value (sees 1) thread1: write value ( value +1 ) thread2: read value ( sees either 1 or 2 depending on timing) thread2: write value ( either 2 or 3 depending on earlier read) bad... thread1: read value (sees 1) thread1: write value ( value +1 (2) ) thread1: "release memory barrier" thread2: "aquire memory barrier" thread2: read value ( sees 2 ) thread2: write value ( writes value +1 (3) ) better x86, amd64 and MIPS (I think) cpus "currently" seem to be cache coherent where the release and aquire are managed in hardware. Sparc, PowerPC and Itanium seem to require explicit barrier management. I'm sure someone will correct me and fill in the details here. Most newer CPU's have multiple execution units, register renaming and some have "speculative" execution with "write queues" and some have very deep pipelines (P4). This all means that the order in which instructions are executed and values written to memory is determined at run-time and can change dynamically. The CPU is designed to "appear" like the instructions are being executed in sequence. Hence, data is written in almost random order. Some CPU's don't not take into account conflicts that arise with the re-ordering when there are multiple CPUs sharing the same memory, hence the whole idea of "memory barriers". Barriers are instructions that will provide a "sequence point" where the CPU will ensure that the memory operations that occur before the sequence point will be "visible" before any memory operations that occur after the sequence point. These are usually flavoured as "read" barriers and "write" barriers. The nomenclature here gets fuzzy, there seems to be slight differences in semantics between the different systems but this is a list of some of terminology that I've seen so far. write barriers: release producer sfence sink store read barriers: aquire consumer lfence hoist load Specifically, write barriers ensure that all writes to memory before the barrier instruction are made visible to other CPU's before any writes after the barrier. Similarly, read barriers ensure that any reads from memory before the memory barrier occur before any reads after the memory barrier. The "cost" of the barrier instruction is system dependant, and somewhat code dependant as well. No doubt as CPUs become even more complex, the relative cost (in number of potential lost instuctions) could go up. I have no data directly related to the cost of barrier instructions, however, latentcy from CPU to CPU (making a memory change visible from one CPU to another) can vary from the 100's to 1000's of clock cycles. For example, my dual AMD 2400 Athlon MP has a latentcy of around 900 cycles. A dual Opteron 248 clocks in at around 450 cycles. I suspect that each CPU architecture has different limitations and trade-offs and so this can vary significantly from CPU to CPU even in the same family of CPU's. Also, compilers will re-order operations, which usually means that you need to be careful using "volatile" or other system dependent tools to avoid the "optimizer" re-ordering your carefully crafted code. If you use a mutex to synchronize access to memory a memory barrier is implicitly done (otherwise the mutex would probably not work very well), hence there is no need to perform memory barriers outside of the regular mutex lock/unlock. However, this does become important when using the DCL or "Double Checked Lock". *(Broken DCL)* static volatile int y; static volatile int v; if ( v == 0 ) { lock( mutex ); if ( v == 0 ) { y = boo(); v = 1; } unlock( mutex ); } // y might not be the return value from // boo() in a different thread than the caller of boo(). do_somthing( y ); There are a number of ways to "fix" this, one is simply use the mutex all the time: *(Eliminate DCL)* static volatile int y; static volatile int v; lock( mutex ); if ( v == 0 ) { y = boo(); v = 1; } unlock( mutex ); do_somthing( y ); Or introduce memory barriers. *(DCL with memory barriers)* static volatile int y; static volatile int v; int local_v = v; read_sync(); if ( local_v == 0 ) { lock( mutex ); if ( v == 0 ) { y = boo(); write_sync(); v = 1; } unlock( mutex ); } do_somthing( y ); I think the one technique below is the "fastest" mechanism and is probably only supported in some compilers. The "__thread" keyword is supported by some compilers as an easy way to access storage that is local to a thread. In other words, these are "thread private" or TLS "Thread Local Storage" or "TSS" (Thread Specific Storage) or "TLD" (Thread Local Data) or "TSD" (Thread Specific Data) - pick one... *(TSS DCL)* static volatile int y; static volatile int v; static __thread private_v; if ( private_v == 0 ) { lock( mutex ); if ( v == 0 ) { y = boo(); write_sync(); v = 1; } unlock( mutex ); private_v = v; } do_somthing( y ); This is only fast if the compiler generates simple read instructions for private_v == 0. I was able to confirm this using gcc on x86 and amd64 targets. If your compiler does not have fast thread local data, then I suspect the DCL with memory barriers might be a better option. .