Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : Michael Pryhodko
Date : Mon Apr 04 2005 09:04 pm

>     If you're going to talk about how your algorithm is efficient on
> massively parallel systems, those aren't going to be based on
Pentiums.
> They'll be based on much newer processors.

Wait a second, I found an interesting algorithm, formalized it, tried
to implement it on x86. I found that:
- I need guarantee about 'wait' section of this algorithm
- it is impossible to implement it on x86 due to cmpxchg behavior

So I complained :) to newsgroup and asked for help/advice on these
topics.

Because most of conversation is about x86 implementation of algorithm
-- it does not matter that it is unuseful on future huge MP systems.
Maybe architect already reading these posting and thinks if his future
system could provide necessary guarantees for this algorithm to work?


> > i.e.:
> > 1. not only 486 PC always locks bus :)
> > 2. operand should be cached and should not be split accross cache
lines
> > 3. and even if all these conditions are true, processor MAY not
lock
> > bus (i.e. Intel does not give guarantee)
>
>     Against this, your algorithm results in more cache ping-ponging
because
> the cache line is never locked.

This poor cacheline ping-ponged because cmpxchg designed in the way it
is designed. If you remove redundant store from cmpxchg it will be
"ping-ponged" once on 'lock' and once on 'unlock'.


> > Hmm... Wait a second, I thought that sfence is placed on pipeline
just
> > like any other instruction and when it is retired -- it simply
flushes
> > store buffers (plus maybe something to do with cache coherency
> > mechanism). In that case if anything lies on pipeline behind sfence
--
> > it will be there, nobody will remove it. Or maybe I am wrong and it
is
> > processed completely in another way, for example:
> > whenever sfence fetched from memory, pipeline is flushed, store
buffers
> > flushed, sfence immediately retired and continue as usual.
>
>     Think about what you're saying. What happens if an instruction
after the
> sfence does a store and that store has already been put in the
buffer?
> Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get
much
> of their speed from out-of-order execution.)

This is impossible. Stores on x86 have 'release' semantic, sfences and
stores are ordered.

from IA-32 Manual Vol 1, 2.2.2.3:
"The retirement unit receives the results of the executed =B5ops from
the out-of-order execution core and processes the results so that the
architectural state updates according to the *!*original program
order*!*.
When a =B5op completes and writes its result, it is retired. Up to three
=B5ops may be retired per cycle. The Reorder Buffer (ROB) is the unit in
the processor which buffers completed =B5ops, updates the architectural
state *!*in order*!*, and manages the ordering of exceptions."

So basically I do not see anything that could prevent me from
implementing 'sfence' which will not flush pipeline (on other hand
maybe I am wrong -- I am not Intel hardware architect). Anyway --
arguing about this point is meaningless, I think that my test app
already proved that sfence cheaper than LOCK. Why? I do not care, it is
a problem of hrdware designers.


> > I agree that from the point of view of ONE GIVEN processor cost of
LOCK
> > could be similar to cost of a fence, But for whole system -- I
think
> > fence is cheaper. Run test app I posted in response to Chris. I was
> > surprised by results :).
>
>     I don't consider your results meaningful because of issues with
your
> testing methodology. For one thing, testing in a tight loop is
unrealistic.
> Also, I don't know if you put the lock variable in its own cache
line.

Check test app, it is small, modify it, make it realistic... It is up
to you. What specific part of this test you do not like?

Bye.
Sincerely yours, Michael.

.