Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : David Schwartz
Date : Mon Apr 04 2005 08:31 pm


"Michael Pryhodko" <mpryhodko@westpac.com.au> wrote in message 
news:1112667067.475111.301190@f14g2000cwb.googlegroups.com...

> "For the Intel486 and *!*Pentium*!* processors, the LOCK# signal is
> always asserted on the bus during a LOCK operation, even if the area of
> memory being locked is cached in the processor.
> For the Pentium 4, Intel Xeon, and P6 family processors, if the area of
> memory being locked during a LOCK operation is *!*cached*!* in the
> processor that is performing the LOCK operation as write-back memory
> and is *!*completely contained in a cache line*!*, the processor *!*may
> not*!* assert the LOCK# signal on the bus. Instead, it will modify the
> memory location internally and allow it's cache coherency mechanism
> to insure that the operation is carried out atomically. This operation
> is called "cache locking." The cache coherency mechanism
> automatically prevents two or more processors that have cached the same
> area of memory from simultaneously modifying data in that area."

    If you're going to talk about how your algorithm is efficient on 
massively parallel systems, those aren't going to be based on Pentiums. 
They'll be based on much newer processors.

> i.e.:
> 1. not only 486 PC always locks bus :)
> 2. operand should be cached and should not be split accross cache lines
> 3. and even if all these conditions are true, processor MAY not lock
> bus (i.e. Intel does not give guarantee)

    Against this, your algorithm results in more cache ping-ponging because 
the cache line is never locked.

>> > 2. I do not see any connection with pipeline depth, AFAIK 'sfence'
>> > and
>> > 'LOCK' does not invalidate pipeline.

>>     They do. The cost of a fence or LOCK is controlled by the
>> pipeline
>> depth. For example, a store fence requires stores to be classified as
>> either
>> "before" or "after" the fence. This requires the fence to be a
>> specific
>> time, not a different time in each of various pipelines.

> Hmm... Wait a second, I thought that sfence is placed on pipeline just
> like any other instruction and when it is retired -- it simply flushes
> store buffers (plus maybe something to do with cache coherency
> mechanism). In that case if anything lies on pipeline behind sfence --
> it will be there, nobody will remove it. Or maybe I am wrong and it is
> processed completely in another way, for example:
> whenever sfence fetched from memory, pipeline is flushed, store buffers
> flushed, sfence immediately retired and continue as usual.

    Think about what you're saying. What happens if an instruction after the 
sfence does a store and that store has already been put in the buffer? 
Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get much 
of their speed from out-of-order execution.)

>>     Whether or not you lock the compare/exchange, the processor must
>> acquire
>> the cache line before it can do anything. And whether or not you lock
>> it,
>> the bus will not be locked, only the cache line might be. Assuming
>> the
>> locked variable is in its own cache line (which is the only sensible
>> way to
>> do it), the cost the LOCK prefix is due to pipeline issues, same as
>> for the
>> fence.

> I agree that from the point of view of ONE GIVEN processor cost of LOCK
> could be similar to cost of a fence, But for whole system -- I think
> fence is cheaper. Run test app I posted in response to Chris. I was
> surprised by results :).

    I don't consider your results meaningful because of issues with your 
testing methodology. For one thing, testing in a tight loop is unrealistic. 
Also, I don't know if you put the lock variable in its own cache line.

    DS

.