Subj : Re: What is the real costs of LOCK on x86 multiprocesor machine?
To   : comp.programming.threads
From : Joe Seigh
Date : Sun Jul 31 2005 10:56 am

chris noonan wrote:
> Joe Seigh wrote:
> 
> Now that memory chips have millions of transistors,
> a few could be spared for a primitive ALU. Then add some
> extra transaction types to the memory bus. One
> such transaction would implement waiting on a
> semaphore. The memory controller performs a read
> cycle to get the value of the specified memory word,
> decrements it in its ALU (unless already zero),
> performs a write cycle to put the new value back in
> memory, then returns the old value of the word
> across the bus to the requesting processor. This
> sequence would be atomic with respect to other
> processors, trivially.

This is called fetch-and-op, op being inc(rement), dec(rement),
etc...  It's an atomic read, modify, write instruction.
Itanium has it with the fetchadd instruction though it
only works that way with special uncached memory so you
can't really use it with normal shared memory.
> 
>>From the programmer's or compiler writer's perspective,
> a critical region would be achieved by waiting on a
> binary semaphore (using the machine instruction described
> in the previous paragraph), accessing the protected
> data freely, then signalling the semaphore via the
> memory controller with another special bus transaction.
> The processor data cache (or parts of it) would have to
> be invalidated after waiting on the semaphore and flushed
> back to RAM before signalling it.
> 
> As an elaboration, extra logic at the memory controller
> could interrupt a processor when it needed to "sleep"
> (i.e. reschedule its threads when waiting on a semaphore
> already zero) or "wake up" (upon signalling of a waited-for
> semaphore), queueing the processors to ensure fairness.
> 
> It is likely that such a scheme would require considerably
> less silicon than used by the Pentium processor, with
> its MESI, snooping, bus-locking etc. logic. Does
> anyone know if it has been attempted?

Hardware scheduling of threads?  Not unless you count hyperthreading
with MONITOR/MWAIT for waiting on a lock.

I'm thinking of more cache friendly algorithms like RCU which doesn't
care if stale data is being accessed (subject to load dependent ordering).
So if you relaxed the MESI protocol somewhat you could reduce the cache
traffic.  This could be a *major* performance factor in something
like ccNUMA where accessing global memory is expensive and not updating
stale (invalid) cache lines if you don't have to would help a lot.


-- 
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

.