1600
Subj : Re: lockless low-overhead 'pipes' (w/semaphores)
To   : comp.programming.threads
From : lindahlb
Date : Wed Apr 20 2005 02:48 pm

> Cache is transparent.  You can't even tell it's there except for
> performance effects.  And you cannot write a multi-threaded
> program whose correctness depends on the presence or not of cache
> (except for Alpha which is no longer signficant).  It's as
> relevant to the issue of correctness as the color of your computer
> case.  And if you or anybody starts talking about the color of
> your computer case, the rest of us aren't going to take you or
> them too seriously.

Right thats what I had understood, until David Schwartz used stale
data in the cache as an example that would break my code - for
which I interpretted
incorrectly. Moving on...

> If you are going to program at that level you should become more
> familiar with what memory ordering mechanism the architecture
> provides.

Most definately, but in order to make it generic, one needs to know
the high level requirements and encapsulate the low-level in a
generic api. I agree I need to read up more on the exact syntax of
the barriers for different architectures especially considering the
lack of a standard - though hopefully we'll see something similiar
to Alexander's msync::* sometime soon, I very much like that
abstraction (although its not an end-all to portable performance,
considering that one algorithm may perform better than another under
different memory models).

> It's hard to tell what you're doing wrong since you're using a non
> standard api. (snipped...)

Sorry for the confusion, I am indeed aware of memory ordering
requirements. For the sake of discussion, lets narrow the scope and
describe the API.
1) Assume the x86 model (up to i686) as its my *main* target
platform.
2) Assume no compiler reordering (can be avoided via assmebly as
needed).
3) Assume the semaphore objects are provided by the target OS and
have the proper acquire and release ordering.
4) Assume the 'atomic_get_add' and 'atomic_get_sub' APIs are
wrappers for xadd instructions with the lock prefix.
5) I doubt there was confusion on 'do_something_with_data', but all
it does is read data from 'buffer+i' to 'buffer+i+n', arranging wrap
around as necessary.

First, note the behavior of the semaphore - it acts like a gate/barrier
which is explicitly opened, and closes after a context
passes through. This abstraction makes it easier to understand the
implementation. Now to deal with SMP issues...

Isolation (or word-tearing). Apply proper alignment and padding for
the cache (64-bits, I believe?). Non-portable but very simple.

Memory ordering. What we're concerned with is the order of the
producer writing to the buffer, and then writing to 'len' - thus a
write-before-write ordering ("lock" prefix, or "mfence" before the
xadd instruction). The consumer is concerned with the order of
reading 'len' and then reading from the buffer - thus a read-before
read ordering ("lfence" - missing in my original implementation).
The consumer also subtracts 'n' from 'len' to notify the producer,
which must be ordered after the consumer finishes with the data in
the buffer - thus a read-before-write ordering ("lock" prefix, or
"mfence" before the xadd instruction). That is the only ordering
we need(?), so my original implementation is correct with respect to
ordering, except for the missing "lfence" instruction or read-before-
read ordering in the consumer. I believe "lfence" is an extension of
x86, so for older x86's, a lock on a dummy instruction would be
suitable.

Next subject is atomicity. The atomicity required is that the reads
and writes of data in the buffer are atomic with repsect to the
other process/thread, this is true because we have proper memory
barriers after both operations. Atomicity is also required for
the 'len' variable, since both processes (or threads) will be
writing to it. Atomicity is gauranteed with the xadd instruction by
adding the "lock" prefix - problem solved.

Next is forward progress. We need to make sure we don't erroneously
block on the semaphores. This can happen in two places, but only in
one way. The producer is about to block on the 'consume' semaphore
because 'len == maxlen'. The consumer will ALWAYS see 'len'
as 'maxlen' in this situation because we atomically retrieve the
original value of 'len' and update 'len' to the new value, thus we
never see anything other than 'maxlen' as the original value. The
same holds true for the reverse, when the consumer is about to block
on 'len ==0'.

On a side note, the worst that can happen is that we release the
semaphore when we don't need to - when 'len == 0' and 'len' is
updated by the producer before the consumer has a chance to enter
the while loop. If the producer would enter the while loop the next
time 'len == maxlen', then it would busy wait once, but it would not
pass through because of the while loop - not a big deal. The same
holds true in the reverse for the consumer waiting with 'len == 0'.

So.. did I miss anything, and do I have a correct implementation now
under the new assumptions?

I really appreciate the help here. I realize that I am new to this,
but I feel I have a pretty decent understanding for a novice -
thanks to plenty of links and discussions between guys like you,
Alexander, and Chris (formerly SenderX). I'm sure there are many
more lurkers here who are following the discussions about lockless
algorithms (sorry if I missed any important contributors).

.

0