Subj : Re: Memory visibility and the C compiler... ( a bit long ) To : comp.programming.threads From : SenderX Date : Wed Jan 12 2005 06:52 pm Fu@#$%in outlook inserted a bunch of double-spaces making the document hard to read. Let me try posting it one more time. Crossing-fingers... :O Excerpt from Section 2 ( rough-draft ) **************************************************** 2 Memory Visibility and the C Compiler =========== How does the C language relate to threads? The simple and correct answer is: They are not related in anyway, shape, or form. If you try to implement most lock-free algorithms in C you're exposing yourself to a number of issues. Virtually all lock-free algorithms rely on their instructions being executed in a precise order. If the ordering of "critical instructions" is messed with by the complier in any way, get ready for combating weird and really hard to find race-conditions. It must also be noted that these critical instructions can not only be reordered by the compiler, they can be affected in adverse ways by the processor as well. Luckily, the processor's instruction set provides us with what we need to insure the ordering of instructions and the effects they have on memory. These instructions are commonly called "memory barriers". How does the AppCore library attempt to shield its critical instructions from compiler and/or processor reordering? 2.1 Assembly Language =========== Externally compiled functions written in 100% assembly language can help alleviate the problems associated with the compiler and/or processor reordering "critical instructions". Since an assembler will not reorder our instructions, we can be assured that everything will be as expected. The ordering of instructions will be correct and the memory barriers will stay exactly where we put them. For simplicity, this paper uses the well know i686 instruction-set and assumes the basic calling convention used by virtually all existing 80x86 C compilers. Function parameters are pushed on the stack in reverse-order. Then the call instruction is executed which pushes the address of the next instruction on the stack and jumps to the functions entry point. The registers esi, ebx, and edi shall be preserved and all push instructions shall have an equal number of pop instructions before the function can safely return. A function call is returned by using the ret instruction which pops the address of the next instruction off the stack and jumps to it. The return value shall be located in eax. All C-like pseudo-code in this paper is assumed to be executed in the exact order shown, and use the calling convention described above. This paper also contains assembly language snippets. These snippets conform to the described calling convention. The ordering of their instructions is enforced by memory barrier instructions and the assembler. 2.2 Memory Barriers =========== These special instructions are used to synchronize access to shared data. Weak memory models coupled with multiple processors means that one processors modification to shared memory may or may not be fully visible to another processor. One processor might need wait until its affects on shared memory are fully applied before it executes is next instruction; a memory barrier would be used to do the waiting. Another processor might need to ensure its ok to read shared data my synchronizing it view of the memory with the processor that updated it. Again a memory barrier would be used to wait. As you can see, a sort of producer/consumer relationship is exposed. Memory barriers are used on both the producer side and the consumer side. Some processors offer a number of different memory barriers. Some affect only loads. Others may only affect stores. Some impose ordering on loads and stores. You need to choose the correct barriers for a particular algorithm. A standard mutex needs barriers that affect both loads and stores. A lock-free single read-write queue needs a barrier that only affects stores for producers and loads for consumers. It basically comes down to performance. The less restrictive the barriers are the faster an algorithm will perform. The AppCore library attempts to provide a fairly fine-grain set of barriers that can lead to optimal results on certain processors by exposing the following externally assembled routines that implement its Memory Barrier API: - ac_cpu_mb_producer - producer side barrier - ac_cpu_mb_consumer - consumer side barrier - ac_cpu_mb_consumer_dep - consumer side barrier /w data-dependency - ac_cpu_mb_full - fenced barrier - ac_cpu_mb_acq - acquire side barrier - ac_cpu_mb_acq_dep - acquired side barrier /w data-dependency - ac_cpu_mb_rel - released side barrier /* i686 producer barrier */ extern void ac_cpu_i686_mb_producer( void ); /* i686 consumer barrier */ extern void ac_cpu_i686_mb_consumer( void ); /* i686 full barrier */ extern void ac_cpu_i686_mb_full( void ); /* define compatible i686 memory barrier API */ #define ac_cpu_mb_producer ac_cpu_i686_mb_producer #define ac_cpu_mb_consumer ac_cpu_i686_mb_consumer #define ac_cpu_mb_consumer_dep #define ac_cpu_mb_full ac_cpu_i686_mb_full #define ac_cpu_mb_acq ac_cpu_mb_full #define ac_cpu_mb_acq_dep ac_cpu_mb_full #define ac_cpu_mb_rel ac_cpu_mb_full /* i686 producer barrier implementation */ ..globl ac_cpu_i686_mb_producer ac_cpu_i686_mb_producer: sfence ret /* i686 consumer barrier implementation */ ..globl ac_cpu_i686_mb_consumer ac_cpu_i686_mb_consumer: lfence ret /* i686 full barrier implementation */ ..globl ac_cpu_i686_mb_full ac_cpu_i686_mb_full: mfence ret I will now try to briefly explain the semantics of two well known memory visibility models: 2.2.1 Producer/Consumer Model =========== Imagine if a processor "A" prepares to "produce" an object. After it creates the object it then points an initially null shared pointer to it, indicating the object is completed and ready to be "consumed". Another processor "B" reads the shared pointer and notices that it's non-null and begins to access the pointed-to object: /* a simple example of broken pseudo-code */ static volatile object_t *shared = 0; /* processor A */ object_t *local = create_object( . ); shared = local; /* processor B */ object_t *local = shared; if ( local ) { use_object( local ); } Does processor B have a clear view of the object? The answer should always be: No. Processor A naively assumed that the object it's producing will be fully visible to a consumer. Think if the stores to the shared object were not executed until after Processor B consumed the data. This would be a massive race-condition that could corrupt the entire application. This type of behavior is simply unacceptable. How do we resolve this issue? The first part of the solution is to have processor A wait for the stores to be fully completed "before" it sets the shared pointer. We have solved the "producer" side, how do we solve the consumer side? After Processor B reads the shared pointer it needs to be sure that its synchronized with Processor A's producer barrier: /* the fixed example pseudo-code */ static volatile object_t *shared = 0; /* processor A */ object_t *local = create_object( . ); ac_cpu_mb_producer(); shared = local; /* processor B */ object_t *local = shared; ac_cpu_mb_consumer_dep(); if ( local ) { use_object( local ); } Now Processor B is guaranteed to see a complete object. It's all about the consumer waiting for the producer to finish after it's notified, and the producer finishing up before it notifies a consumer. This basic producer/consumer relationship between processors is an essential part of memory visibility. 2.2.2 Acquire/Release Model =========== This is the extremely strict memory model that everybody's use to. Anyone who has used mutexs has been conforming to acquire/release memory visibility semantics all along. It a more stringent version of the producer/consumer model where the unlocking of a mutex produces data and the locking consumes it. However, acquire/release needs to affect both load and stores on the consumer and producer sides. Basically, the model goes something like this: The act of locking a mutex executes an acquire barrier "after" the lock's state is atomically set to a value indicating to other threads that it's locked. This means that no stores from after the barrier will reach visibility, and no loads from after the barrier will be applied, "before" the lock acquisition reaches global visibility. The act of unlocking a mutex executes a release barrier before the lock's state is atomically set to a value indicating to other thread that it's unlocked. This means that all loads and stores before the barrier will be applied before the subsequent lock release reaches global visibility. 2.3 Atomic Operations =========== Sometimes a thread needs to execute a read-modify-write cycle without being interrupted. Most processors provide special instructions that take whatever measures necessary to provide this functionality. They are commonly called atomic operations. The word atomic implies indivisibility and irreducibility. So it fits the application well. Since this paper is using i686 all of the atomic operations listed in this section use the lock instruction. You prefix lock to an instruction that needs to be executed atomically with regard to multi-processors. This ensures the instruction is executed atomically with a full memory barrier before the next instruction is executed. Intel's atomic operations make it so a programmer doesn't have to add the memory barriers by hand. This can limits their flexibility and performance, but it's the way things are done on x86. All of AppCore's algorithms can be implemented with simple atomic operations, all of which can be implemented with normal compare-and-swap ( CAS ) or load-linked store-conditional ( LL/SC ) instructions. The library can also be configured to make certain algorithms use a double-width compare-and-swap ( DWCAS ) instruction. This gives us the ability to atomically modify both a pointer and an adjacent word. DWCAS can be used to implement much more complex and flexible lock-free algorithms. The AppCore library exposes the following types and externally assembled routines that implement its Atomic API: Types =========== - ac_atomic_t - signed word-sized variable. - ac_uatomic_t - unsigned word-sized variable. Fence =========== - ac_atomic_cas_full - fenced CAS - np_ac_atomic_dwcas_full - fenced DWCAS - ac_atomic_cas_full_naked - fenced CAS /w naked failure - np_ac_atomic_dwcas_full_naked- fenced DWCAS /w naked failure - ac_atomic_xchg_full - fenced exchange - ac_atomic_xadd_full - fenced exchange-and-add - ac_atomic_inc_full - fenced increment - ac_atomic_dec_full - fenced decrement Acquire =========== - ac_atomic_cas_acq - acquired CAS - np_ac_atomic_dwcas_acq - acquired DWCAS - ac_atomic_cas_acq_naked - acquired CAS /w naked failure. - np_ac_atomic_dwcas_acq_naked - acquired DWCAS /w naked failure - ac_atomic_xchg_acq - acquired exchange - ac_atomic_xadd_acq - acquired exchange-and-add - ac_atomic_inc_acq - acquired increment - ac_atomic_dec_acq - acquired decrement Release =========== - ac_atomic_cas_rel - released CAS - np_ac_atomic_dwcas_rel - released DWCAS - ac_atomic_cas_rel_naked - released CAS /w naked failure. - np_ac_atomic_dwcas_rel_naked - released DWCAS /w naked failure - ac_atomic_xchg_rel - released exchange - ac_atomic_xadd_rel - released exchange-and-add - ac_atomic_inc_rel - released increment - ac_atomic_dec_rel - released decrement 2.3.1 Compare-and-Swap ( CAS ) =========== This function takes three parameters: A destination pointer to a word-sized variable, a value to compare it against, and a new value to replace it with. The function atomically checks if the pointed to destination variable are equal to the comprand. If they are equal, the pointed to variable is replaced with the new value. If they are not equal, there is no effect. The function always returns the original value of the pointed to variable. /* i686 fenced CAS */ extern ac_atomic_t ac_cpu_i686_atomic_cas_full ( volatile ac_atomic_t *dest, ac_atomic_t cmp, ac_atomic_t xchg ); /* define a compatible i686 CAS API */ #define ac_atomic_cas_full ac_cpu_i686_atomic_cas_full #define ac_atomic_cas_acq ac_atomic_cas_full #define ac_atomic_cas_rel ac_atomic_cas_full #define ac_atomic_cas_full_naked ac_atomic_cas_full #define ac_atomic_cas_acq_naked ac_atomic_cas_full #define ac_atomic_cas_rel_naked ac_atomic_cas_full /* i686 fenced CAS implementation */ ..globl ac_cpu_i686_atomic_cas_full ac_cpu_i686_atomic_cas_full: movl 4(%esp), %ecx movl 8(%esp), %eax movl 12(%esp), %edx lock cmpxchgl %edx, (%ecx) ret 2.3.2 Double-Width Compare-and-Swap ( DWCAS ) =========== This function takes three parameters: A pointer to a destination doubleword-sized variable, a pointer to a value to compare it against, and a pointer to a new value. The function atomically checks if the destination is equal to the comprand. If they are equal, the destination is replaced with the comprand. If they are not equal, the comprand is updated with the destination value. The function returns non-zero if the operation failed. /* i686 fenced DWCAS */ extern int np_ac_cpu_i686_atomic_dwcas_full ( volatile void *dest, void *cmp, const void *xchg ); /* define a compatible i686 DWCAS API */ #define np_ac_atomic_dwcas_full np_ac_cpu_i686_atomic_dwcas_full #define np_ac_atomic_dwcas_acq np_ac_atomic_dwcas_full #define np_ac_atomic_dwcas_rel np_ac_atomic_dwcas_full #define np_ac_atomic_dwcas_full_naked np_ac_atomic_dwcas_full #define np_ac_atomic_dwcas_acq_naked np_ac_atomic_dwcas_full #define np_ac_atomic_dwcas_rel_naked np_ac_atomic_dwcas_full /* i686 fenced DWCAS implementation */ ..globl np_ac_cpu_i686_atomic_dwcas_full np_ac_cpu_i686_atomic_dwcas_full: pushl %esi pushl %ebx movl 16(%esp), %esi movl (%esi), %eax movl 4(%esi), %edx movl 20(%esp), %esi movl (%esi), %ebx movl 4(%esi), %ecx movl 12(%esp), %esi lock cmpxchg8b (%esi) movl $0, %ebx je np_ac_cpu_i686_atomic_dwcas_full_success movl 16(%esp), %esi movl %eax, (%esi) movl %edx, 4(%esi) movl $1, %ebx np_ac_cpu_i686_atomic_dwcas_full_success: movl %ebx, %eax popl %ebx popl %esi ret 2.3.3 Exchange-and-Add ( XADD ) =========== This function takes two parameters: A pointer to a word-sized variable, and a value to increment it by. It then atomically increments the pointed to variable by the passed value. The function always returns the original value of the pointed to variable. /* i686 fenced XADD */ extern ac_atomic_t ac_cpu_i686_atomic_xadd_full ( volatile ac_atomic_t *dest, ac_atomic_t xchg ); /* define a compatible i686 XADD API */ #define ac_atomic_xadd_full ac_cpu_i686_atomic_xadd_full #define ac_atomic_xadd_acq ac_atomic_xadd_full #define ac_atomic_xadd_rel ac_atomic_xadd_full /* i686 fenced XADD implementation */ ..globl ac_cpu_i686_atomic_xadd_full ac_cpu_i686_atomic_xadd_full: movl 4(%esp), %ecx movl 8(%esp), %eax lock xaddl %eax, (%ecx) ret 2.3.4 Exchange ( XCHG ) =========== This function takes two parameters: A pointer to a word-sized variable, and a value to exchange it with. Then the operation atomically replaces the pointed to variable by the passed value. The function always returns the original value of the pointed to variable. /* i686 fenced XCHG */ extern ac_atomic_t ac_cpu_i686_atomic_xchg_full ( volatile ac_atomic_t *dest, ac_atomic_t xchg ); /* define a compatible i686 XCHG API */ #define ac_atomic_xchg_full ac_cpu_i686_atomic_xchg_full #define ac_atomic_xchg_acq ac_atomic_xchg_full #define ac_atomic_xchg_rel ac_atomic_xchg_full /* i686 fenced XCHG implementation */ ..globl ac_cpu_i686_atomic_xchg_full ac_cpu_i686_atomic_xchg_full: movl 4(%esp), %ecx movl 8(%esp), %eax xchgl %eax, (%ecx) ret 2.3.5 Increment and Decrement =========== These functions take a single parameter: A pointer to a word-sized variable. The pointed to variable is atomically incremented or decrement by one, and the new value is returned. /* i686 fenced increment */ extern ac_atomic_t ac_cpu_i686_atomic_inc_full ( volatile ac_atomic_t *dest ); /* i686 fenced decrement */ extern ac_atomic_t ac_cpu_i686_atomic_dec_full ( volatile ac_atomic_t *dest ); /* define a compatible i686 increment/decrement API */ #define ac_atomic_inc_full ac_cpu_i686_atomic_inc_full #define ac_atomic_inc_acq ac_atomic_inc_full #define ac_atomic_inc_rel ac_atomic_inc_full #define ac_atomic_dec_full ac_cpu_i686_atomic_dec_full #define ac_atomic_dec_acq ac_atomic_dec_full #define ac_atomic_dec_rel ac_atomic_dec_full /* i686 fenced increment implementation */ ..globl ac_cpu_i686_atomic_inc_full ac_cpu_i686_atomic_inc_full: movl 4(%esp), %ecx movl $1, %eax lock xaddl %eax, (%ecx) incl %eax ret /* i686 fenced decrement implementation */ ..globl ac_cpu_i686_atomic_dec_full ac_cpu_i686_atomic_dec_full: movl 4(%esp), %ecx movl $-1, %eax lock xaddl %eax, (%ecx) decl %eax ret .