973 Subj : Re: What is the real costs of LOCK on x86 multiprocesor machine? To : comp.programming.threads From : chris noonan Date : Sun Jul 31 2005 06:31 am Joe Seigh wrote: > You'll have to figure out the contention factor which would be dependent on > the number of cpus. If it's a problem there are more cache friendly algorithms > out there though it would be much better if the cache coherency protocols weren't > stuck back in the 70's and 80's as far as threaded programming was concerned. In light of the disappointing performance of multiprocessor PCs, perhaps the MESI cache coherency approach is not the best one. The physical point of interaction between threads on different processors communicating via shared memory should be as far away from the processors as possible, at the memory (RAM) itself. Now that memory chips have millions of transistors, a few could be spared for a primitive ALU. Then add some extra transaction types to the memory bus. One such transaction would implement waiting on a semaphore. The memory controller performs a read cycle to get the value of the specified memory word, decrements it in its ALU (unless already zero), performs a write cycle to put the new value back in memory, then returns the old value of the word across the bus to the requesting processor. This sequence would be atomic with respect to other processors, trivially. >From the programmer's or compiler writer's perspective, a critical region would be achieved by waiting on a binary semaphore (using the machine instruction described in the previous paragraph), accessing the protected data freely, then signalling the semaphore via the memory controller with another special bus transaction. The processor data cache (or parts of it) would have to be invalidated after waiting on the semaphore and flushed back to RAM before signalling it. As an elaboration, extra logic at the memory controller could interrupt a processor when it needed to "sleep" (i.e. reschedule its threads when waiting on a semaphore already zero) or "wake up" (upon signalling of a waited-for semaphore), queueing the processors to ensure fairness. It is likely that such a scheme would require considerably less silicon than used by the Pentium processor, with its MESI, snooping, bus-locking etc. logic. Does anyone know if it has been attempted? Chris . 0