Subj : Re: CMPXCHG timing To : comp.programming.threads From : Michael Pryhodko Date : Thu Mar 31 2005 05:38 pm Well, I have decided to show my child :) It somewhat raw but readable: const P = 123; // unique thread id old_val CAS(addr, cmp_val, new_val) // this is one processor instruction (i.e. OS can't interrupt in the middle) { // instruction microcode // CAS.1 - range label, used in speculations below FETCH CACHE_LINE(addr) // unpredictable timing (could it take 0 time?) // CAS.2 - range label, used in speculations below // timing of this part is predictable (fixed?) DEST := *addr IF eax = DEST THEN ZF <- 1 DEST = new_val ELSE ZF <- 0 EAX = DEST FI; } A = 0; // lock flag, aligned for CAS operation lock() { // drain store buffers of every thread upon entry // could help with estimations in 'wait' step below // use mfence? is it possible for 'read' part of CAS to pass this sfence? sfence ENTRY: // if OS interrupts here store buffers will be empty upon return anyway // OS can't interrupt in the middle of CAS, since it is one instruction old_val = CAS(&A, 0, P); // if OS interrupts here this will drain store buffers // (i.e. it will do the same as 'sfence' call below) // flush store buffers to make new value visible to other threads sfence // check if we could participate at all // put it before 'sfence'? or processor could do it itself? if (old_val != 0) { pause // could help with P6 branch prediction jmp ENTRY } // now we need to wait enough time for every thread (which entered his CAS.2 // before we finished our 'sfence') to finish his [CAS.2, sfence] // note: thread will 100% lose if it will enter [CAS.2, sfence] AFTER this thread // finishes 'sfence' instruction // how to wait this time? // 1. we could estimate maximum number of clocks for [CAS.2, sfence] to finish and wait this time // 2. emulate this time with another [CAS.2, sfence] call to dummy variable B? I wonder if writing to memory will take the same time regardless of address (considering the same memory type) // 3. use some kind of barrier? //!! how? // maybe like this: B = 0 // B is thread-local variable (e.g. on stack) CAS(&B, 0, 1) // CAS(&B, B, ~B)? -- in this case no need in 'B = 0' (if optimization won't kill us, hopefully '&B' and '~B' instructions will negate this) // sfence // with 'B=0' this operation could take even more than other threads CAS 'sfence' // this operation should cover 'part of CAS'+'sfence' timing of other processors // (considering that: // 1. 'write to memory' time is the same regardless of address and processor) // 2. existing optimization technique did not killed us // prevent 'if (A != P)' from passing 'wait' step above // lfence // hmmm 'lfence' coud pass 'sfence' -> 'read A' could pass sfence -> bad, change 'sfence + lfence' -> 'mfence' mfence // check for winner if (A != P) { pause // could help with P6 branch prediction jmp ENTRY } } unlock() { // sanity check ASSERT(A == P); // mark flag 'free' A = 0; // if OS will interrupt here this will drain store buffers // (i.e. it will do the same as 'sfence' call below) // make it visible to other threads // we could skip this instruction but it will speed up lock release // thus increasing overall performance in high-contention case sfence } .