1600 Subj : Re: lockless low-overhead 'pipes' (w/semaphores) To : comp.programming.threads From : lindahlb Date : Wed Apr 20 2005 02:48 pm > Cache is transparent. You can't even tell it's there except for > performance effects. And you cannot write a multi-threaded > program whose correctness depends on the presence or not of cache > (except for Alpha which is no longer signficant). It's as > relevant to the issue of correctness as the color of your computer > case. And if you or anybody starts talking about the color of > your computer case, the rest of us aren't going to take you or > them too seriously. Right thats what I had understood, until David Schwartz used stale data in the cache as an example that would break my code - for which I interpretted incorrectly. Moving on... > If you are going to program at that level you should become more > familiar with what memory ordering mechanism the architecture > provides. Most definately, but in order to make it generic, one needs to know the high level requirements and encapsulate the low-level in a generic api. I agree I need to read up more on the exact syntax of the barriers for different architectures especially considering the lack of a standard - though hopefully we'll see something similiar to Alexander's msync::* sometime soon, I very much like that abstraction (although its not an end-all to portable performance, considering that one algorithm may perform better than another under different memory models). > It's hard to tell what you're doing wrong since you're using a non > standard api. (snipped...) Sorry for the confusion, I am indeed aware of memory ordering requirements. For the sake of discussion, lets narrow the scope and describe the API. 1) Assume the x86 model (up to i686) as its my *main* target platform. 2) Assume no compiler reordering (can be avoided via assmebly as needed). 3) Assume the semaphore objects are provided by the target OS and have the proper acquire and release ordering. 4) Assume the 'atomic_get_add' and 'atomic_get_sub' APIs are wrappers for xadd instructions with the lock prefix. 5) I doubt there was confusion on 'do_something_with_data', but all it does is read data from 'buffer+i' to 'buffer+i+n', arranging wrap around as necessary. First, note the behavior of the semaphore - it acts like a gate/barrier which is explicitly opened, and closes after a context passes through. This abstraction makes it easier to understand the implementation. Now to deal with SMP issues... Isolation (or word-tearing). Apply proper alignment and padding for the cache (64-bits, I believe?). Non-portable but very simple. Memory ordering. What we're concerned with is the order of the producer writing to the buffer, and then writing to 'len' - thus a write-before-write ordering ("lock" prefix, or "mfence" before the xadd instruction). The consumer is concerned with the order of reading 'len' and then reading from the buffer - thus a read-before read ordering ("lfence" - missing in my original implementation). The consumer also subtracts 'n' from 'len' to notify the producer, which must be ordered after the consumer finishes with the data in the buffer - thus a read-before-write ordering ("lock" prefix, or "mfence" before the xadd instruction). That is the only ordering we need(?), so my original implementation is correct with respect to ordering, except for the missing "lfence" instruction or read-before- read ordering in the consumer. I believe "lfence" is an extension of x86, so for older x86's, a lock on a dummy instruction would be suitable. Next subject is atomicity. The atomicity required is that the reads and writes of data in the buffer are atomic with repsect to the other process/thread, this is true because we have proper memory barriers after both operations. Atomicity is also required for the 'len' variable, since both processes (or threads) will be writing to it. Atomicity is gauranteed with the xadd instruction by adding the "lock" prefix - problem solved. Next is forward progress. We need to make sure we don't erroneously block on the semaphores. This can happen in two places, but only in one way. The producer is about to block on the 'consume' semaphore because 'len == maxlen'. The consumer will ALWAYS see 'len' as 'maxlen' in this situation because we atomically retrieve the original value of 'len' and update 'len' to the new value, thus we never see anything other than 'maxlen' as the original value. The same holds true for the reverse, when the consumer is about to block on 'len ==0'. On a side note, the worst that can happen is that we release the semaphore when we don't need to - when 'len == 0' and 'len' is updated by the producer before the consumer has a chance to enter the while loop. If the producer would enter the while loop the next time 'len == maxlen', then it would busy wait once, but it would not pass through because of the while loop - not a big deal. The same holds true in the reverse for the consumer waiting with 'len == 0'. So.. did I miss anything, and do I have a correct implementation now under the new assumptions? I really appreciate the help here. I realize that I am new to this, but I feel I have a pretty decent understanding for a novice - thanks to plenty of links and discussions between guys like you, Alexander, and Chris (formerly SenderX). I'm sure there are many more lurkers here who are following the discussions about lockless algorithms (sorry if I missed any important contributors). . 0