Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : Michael Pryhodko
Date : Mon Apr 04 2005 07:12 pm

>>>     I don't see how the redundant store hurts this. If you didn't
get
>>> the
>>> lock, you have to loop and try again anyway, right?
>
>> Right, It does not hurt here. It makes 'unlock' operation
impossible.
>
>     Not impossible. It just means that after you 'unlock', you have
to go
> back and make sure you don't still have the lock, just as you did
with the
> 'lock' operation.

I did that (see augmented 'unlock' code I posted immediately after
original 'unlock' code). But it will work only if you have less than 3
processors.
First -- it could loop forever. I.e. unlocking thread makes his '0'
visible exactly at the time when locking thread doing it's redundant
write. Then redundant change becomes visible ( jmp $ :) ).
Hmmm... I can't find a failure case for 3 processors right now, I
remember I found some problems. Main idea:
- you have 3 processors which waiting in a loop (i.e. constantly
failing cmpxchg and making redundant write)
- 4th processor tries to unlock by writing 0
- 1st processors makes successfull check and writes his P1
- 2nd processors "picks it up" (i.e. soon will write it back)
- 3rd processors just makes his redundant write P4 visible

I do not remember details of this case, but I think 'infinite' loop
feature is enough to throw this 'unlock' implementation away.


>> LOCK could be much more expensive than fence. LOCK could influence
>> whole system locking system bus (if mem_addr not in the cache or
split
>> accross cache lines or you have 486 PC), while fence influences only
>> given processor. Thus fence becomes more and more desirable if you
have
>> increasing number of processors. (Considering that cache coherency
>> mechanism does not make 'sfence' to execute longer if you have more
>> processors -- I do not know details about cache coherency mehanism
>> implementations).
>
>     First of all, you would always put the lock in its own cache
line. A
> processor would have to own the cache line in order to perform the
unlocked
> compare and exchange anyway. Your argument about a 486PC isn't
convincing
> because the pipeline depth is so much shorter that the LOCK prefix
hurts
> each CPU less. Plus, there aren't massively parallel SMP 486 systems.

1. If m_lock variable is not cached, processor could (or will?) lock
system bus.
2. I do not see any connection with pipeline depth, AFAIK 'sfence' and
'LOCK' does not invalidate pipeline.

What do you mean by "owning" cache line? if you mean LOCK'ing it (in
order to avoid bus lock) -- it is not true, because only XCHG
implicitly locks, not CMPXCHG.


>> :) This algorithm is quite interesting by itself. Maybe the way it
is
>> implemented for x86 is fragile, but there are other platforms that
>> could provide necessary guarantees and benefit from this algorithm.
>> Consider my work as research.
>
>     That may be true, but isn't it obvious by now that x86 is not the

> platform to test/develop this on?

1. I have mostly theoretical knowledge of other platforms (well, once I
read manual for SPARC assembler, but it was ~7 years ago :) )
2. I have access to x86 only

By the way, specially for Chris Thomasson I did small test program
which measures performance of this lock on x86. I suggest you to have a
look -- it will appear in 5 mins in the parallel branch of this
discussion.

Bye.
Sincerely yours, Michael.

.