Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : Michael Pryhodko
Date : Fri Apr 01 2005 10:06 pm

> > 2. But since could have a number of otherprocessors that currently
in
> > process of executing their 'cmxchg+sfence' instructions we need to
wait
> > for them to finish. After we finished waiting (whether it is
Sleep(100)
> > or store to dummy variable) we could simply check value of lock to
find
> > which processor "wins". Because after that point every processor
either
> > trying to enter the lock or already finished 'store' part of
cmpxchg
> > and made his changes visible. Winner is the last processor who
writes
> > to  lock variable. Other will start process from the beginning.
>
>     I don't see how the redundant store hurts this. If you didn't get
the
> lock, you have to loop and try again anyway, right?

Right, It does not hurt here. It makes 'unlock' operation impossible.


> > Run test app and you'll see -- redundand store makes it impossible
to
> > implement lock. Read my corresponding posting about that.
>
>     I have read it but I don't understand it. If you don't get the
lock
> (whether due to the redundant store or due to another processor
grabbing
> it), you just loop.

read more carefully :)). It is impossible to UNLOCK. This algorithm is
based on assumption that given Pi could be written to 'm_lock' only by
processor 'i'. But redundant store in cmpxchg violates that. And I can
not replace 'cmpxchg' with 'if (lock == 0) lock = Pi' because OS could
interrupt between 'if (lock == 0)' and 'lock = Pi' spoiling the fun.


> > Because LOCK is not so cheap on x86.
>
>     The fence is about as expensive as the lock.

LOCK could be much more expensive than fence. LOCK could influence
whole system locking system bus (if mem_addr not in the cache or split
accross cache lines or you have 486 PC), while fence influences only
given processor. Thus fence becomes more and more desirable if you have
increasing number of processors. (Considering that cache coherency
mechanism does not make 'sfence' to execute longer if you have more
processors -- I do not know details about cache coherency mehanism
implementations).


> > Because this is very interesting task.
>
>     Yes, just not useful because it relies upon so many fragile
assumptions.

:) This algorithm is quite interesting by itself. Maybe the way it is
implemented for x86 is fragile, but there are other platforms that
could provide necessary guarantees and benefit from this algorithm.
Consider my work as research.


> > :(. Basically only two problems
> > left:
> > 1. implement waiting process (or prove that my variant is ok).
>
>     That is a key problem. You must wait the pipeline depth, which
should
> cost as much as a LOCK.

No, I do not need to wait whole pipeline to retire.
I do not think this is key problem. Something telling me that it is


> > 2. get around redundant store in cmpxchg on x86.
>
>     I don't see how that hurts you. After all the threads get
through,
> either some thread owns the lock or no thread does, shouldn't they
all agree
> on which?

Impossibility to implement 'unlock' operation.


Bye.
Sincerely yours, Michael.

.