Subj : Re: CMPXCHG timing To : comp.programming.threads From : Michael Pryhodko Date : Fri Apr 01 2005 10:06 pm > > 2. But since could have a number of otherprocessors that currently in > > process of executing their 'cmxchg+sfence' instructions we need to wait > > for them to finish. After we finished waiting (whether it is Sleep(100) > > or store to dummy variable) we could simply check value of lock to find > > which processor "wins". Because after that point every processor either > > trying to enter the lock or already finished 'store' part of cmpxchg > > and made his changes visible. Winner is the last processor who writes > > to lock variable. Other will start process from the beginning. > > I don't see how the redundant store hurts this. If you didn't get the > lock, you have to loop and try again anyway, right? Right, It does not hurt here. It makes 'unlock' operation impossible. > > Run test app and you'll see -- redundand store makes it impossible to > > implement lock. Read my corresponding posting about that. > > I have read it but I don't understand it. If you don't get the lock > (whether due to the redundant store or due to another processor grabbing > it), you just loop. read more carefully :)). It is impossible to UNLOCK. This algorithm is based on assumption that given Pi could be written to 'm_lock' only by processor 'i'. But redundant store in cmpxchg violates that. And I can not replace 'cmpxchg' with 'if (lock == 0) lock = Pi' because OS could interrupt between 'if (lock == 0)' and 'lock = Pi' spoiling the fun. > > Because LOCK is not so cheap on x86. > > The fence is about as expensive as the lock. LOCK could be much more expensive than fence. LOCK could influence whole system locking system bus (if mem_addr not in the cache or split accross cache lines or you have 486 PC), while fence influences only given processor. Thus fence becomes more and more desirable if you have increasing number of processors. (Considering that cache coherency mechanism does not make 'sfence' to execute longer if you have more processors -- I do not know details about cache coherency mehanism implementations). > > Because this is very interesting task. > > Yes, just not useful because it relies upon so many fragile assumptions. :) This algorithm is quite interesting by itself. Maybe the way it is implemented for x86 is fragile, but there are other platforms that could provide necessary guarantees and benefit from this algorithm. Consider my work as research. > > :(. Basically only two problems > > left: > > 1. implement waiting process (or prove that my variant is ok). > > That is a key problem. You must wait the pipeline depth, which should > cost as much as a LOCK. No, I do not need to wait whole pipeline to retire. I do not think this is key problem. Something telling me that it is > > 2. get around redundant store in cmpxchg on x86. > > I don't see how that hurts you. After all the threads get through, > either some thread owns the lock or no thread does, shouldn't they all agree > on which? Impossibility to implement 'unlock' operation. Bye. Sincerely yours, Michael. .