Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : Michael Pryhodko
Date : Thu Mar 31 2005 05:38 pm

Well, I have decided to show my child :) It somewhat raw but readable:


const P = 123; // unique thread id

old_val CAS(addr, cmp_val, new_val) // this is one processor
instruction (i.e. OS can't interrupt in the middle)
{
    // instruction microcode

// CAS.1 - range label, used in speculations below
    FETCH CACHE_LINE(addr)   // unpredictable timing (could it take 0
time?)

// CAS.2 - range label, used in speculations below
    // timing of this part is predictable (fixed?)
    DEST := *addr
    IF eax = DEST
        THEN
            ZF <- 1
            DEST = new_val
        ELSE
            ZF <- 0
            EAX = DEST
    FI;
}

A = 0;  // lock flag, aligned for CAS operation

lock()
{
    // drain store buffers of every thread upon entry
    // could help with estimations in 'wait' step below
    // use mfence? is it possible for 'read' part of CAS to pass this
sfence?
    sfence

ENTRY:
    // if OS interrupts here store buffers will be empty upon return
anyway

    // OS can't interrupt in the middle of CAS, since it is one
instruction
    old_val = CAS(&A, 0, P);

    // if OS interrupts here this will drain store buffers
    // (i.e. it will do the same as 'sfence' call below)

    // flush store buffers to make new value visible to other threads
    sfence

    // check if we could participate at all
    // put it before 'sfence'? or processor could do it itself?
    if (old_val != 0)
    {
        pause   // could help with P6 branch prediction
        jmp ENTRY
    }

    // now we need to wait enough time for every thread (which entered
his CAS.2
    // before we finished our 'sfence') to finish his [CAS.2, sfence]
    // note: thread will 100% lose if it will enter [CAS.2, sfence]
AFTER this thread
    // finishes 'sfence' instruction

    // how to wait this time?
    // 1. we could estimate maximum number of clocks for [CAS.2,
sfence] to finish and wait this time
    // 2. emulate this time with another [CAS.2, sfence] call to dummy
variable B? I wonder if writing to memory will take the same time
regardless of address (considering the same memory type)
    // 3. use some kind of barrier?

    //!! how?

    // maybe like this:
    B = 0 // B is thread-local variable (e.g. on stack)
    CAS(&B, 0, 1) // CAS(&B, B, ~B)? -- in this case no need in 'B = 0'
(if optimization won't kill us, hopefully '&B' and '~B' instructions
will negate this)
    // sfence // with 'B=0' this operation could take even more than
other threads CAS 'sfence'
    // this operation should cover 'part of CAS'+'sfence' timing of
other processors
    // (considering that:
    // 1. 'write to memory' time is the same regardless of address and
processor)
    // 2. existing optimization technique did not killed us

    // prevent 'if (A != P)' from passing 'wait' step above
    // lfence

    // hmmm 'lfence' coud pass 'sfence' -> 'read A' could pass sfence
-> bad, change 'sfence + lfence' -> 'mfence'
    mfence

    // check for winner
    if (A != P)
    {
        pause   // could help with P6 branch prediction
        jmp ENTRY
    }
}


unlock()
{
    // sanity check
    ASSERT(A == P);

    // mark flag 'free'
    A = 0;

    // if OS will interrupt here this will drain store buffers
    // (i.e. it will do the same as 'sfence' call below)

    // make it visible to other threads
    // we could skip this instruction but it will speed up lock release
    // thus increasing overall performance in high-contention case
    sfence
}

.