Subj : Re: CMPXCHG timing
To   : comp.programming.threads
From : David Schwartz
Date : Mon Apr 04 2005 08:24 pm


"Michael Pryhodko" <mpryhodko@westpac.com.au> wrote in message 
news:1112665429.416689.146180@o13g2000cwo.googlegroups.com...

> I have already expressed my thoughts about this algorithm comparison
> with simple LOCK-based lock:

> 1. LOCK could lock system bus -- this could create significant
> influence on whole system: more devices are using bus (e.g. processors)
> -- influence becomes worse. sfence does not have this disadvantage
> (AFAIK).

    This is not true on any modern processor I know of.

> 2. AFAIK sfence performance depends on size of store buffer, that means
> in case of:
>
> sfence
> mov mem_addr, smth
> sfence
>
> second sfence will perform much faster than first.

    No. The cost of 'sfence' has to do with sequencing. In order for an 
'sfence' to work, some stores have to be before the fence and some have to 
be after. This imposes a heavy cost on CPUs that get much of their 
performance from out of order execution.

> Now about vast improvement of speed:
> I created small test program (see below), it calculates number of
> lock/unlock pairs in 3 million processor clocks.
>
> NOTE: I replaced cmpxchg with 'if (lock = 0) lock = Pi'. Since I think
> it is impossible to implement this algorithm on x86 due to redundand
> store in cxmpxchg. Probability that OS will interrupt right between
> 'compare' and 'store' is very low (but happened one or two times).
>
> RESULTS:
> I did 5 runs of each case on my single proc P2.8 HT (i.e. 2 processor
> unit)
>
> LOCK-based -- 1493 on average
> my lock -- 2039 on average
> my lock with cmpxchg with 'unlock' augmented to work on 2-processor pc
> -- 1045 on average

    Benchmarking these kinds of things in tight loops gives unrealistic 
results.

    DS

.