Subj : Re: CMPXCHG timing To : comp.programming.threads From : Michael Pryhodko Date : Mon Apr 04 2005 09:04 pm > If you're going to talk about how your algorithm is efficient on > massively parallel systems, those aren't going to be based on Pentiums. > They'll be based on much newer processors. Wait a second, I found an interesting algorithm, formalized it, tried to implement it on x86. I found that: - I need guarantee about 'wait' section of this algorithm - it is impossible to implement it on x86 due to cmpxchg behavior So I complained :) to newsgroup and asked for help/advice on these topics. Because most of conversation is about x86 implementation of algorithm -- it does not matter that it is unuseful on future huge MP systems. Maybe architect already reading these posting and thinks if his future system could provide necessary guarantees for this algorithm to work? > > i.e.: > > 1. not only 486 PC always locks bus :) > > 2. operand should be cached and should not be split accross cache lines > > 3. and even if all these conditions are true, processor MAY not lock > > bus (i.e. Intel does not give guarantee) > > Against this, your algorithm results in more cache ping-ponging because > the cache line is never locked. This poor cacheline ping-ponged because cmpxchg designed in the way it is designed. If you remove redundant store from cmpxchg it will be "ping-ponged" once on 'lock' and once on 'unlock'. > > Hmm... Wait a second, I thought that sfence is placed on pipeline just > > like any other instruction and when it is retired -- it simply flushes > > store buffers (plus maybe something to do with cache coherency > > mechanism). In that case if anything lies on pipeline behind sfence -- > > it will be there, nobody will remove it. Or maybe I am wrong and it is > > processed completely in another way, for example: > > whenever sfence fetched from memory, pipeline is flushed, store buffers > > flushed, sfence immediately retired and continue as usual. > > Think about what you're saying. What happens if an instruction after the > sfence does a store and that store has already been put in the buffer? > Flushing the buffer will flush the wrong stores. (Modern x86 CPUs get much > of their speed from out-of-order execution.) This is impossible. Stores on x86 have 'release' semantic, sfences and stores are ordered. from IA-32 Manual Vol 1, 2.2.2.3: "The retirement unit receives the results of the executed =B5ops from the out-of-order execution core and processes the results so that the architectural state updates according to the *!*original program order*!*. When a =B5op completes and writes its result, it is retired. Up to three =B5ops may be retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed =B5ops, updates the architectural state *!*in order*!*, and manages the ordering of exceptions." So basically I do not see anything that could prevent me from implementing 'sfence' which will not flush pipeline (on other hand maybe I am wrong -- I am not Intel hardware architect). Anyway -- arguing about this point is meaningless, I think that my test app already proved that sfence cheaper than LOCK. Why? I do not care, it is a problem of hrdware designers. > > I agree that from the point of view of ONE GIVEN processor cost of LOCK > > could be similar to cost of a fence, But for whole system -- I think > > fence is cheaper. Run test app I posted in response to Chris. I was > > surprised by results :). > > I don't consider your results meaningful because of issues with your > testing methodology. For one thing, testing in a tight loop is unrealistic. > Also, I don't know if you put the lock variable in its own cache line. Check test app, it is small, modify it, make it realistic... It is up to you. What specific part of this test you do not like? Bye. Sincerely yours, Michael. .