Subj : Re: Windows Suspend/ResumeThread API equivalent for Linux or POSIX?
To   : comp.programming.threads
From : Chris Thomasson
Date : Thu May 12 2005 12:48 am

>> Here is a paper describing the original SMR algorithm:
>>
>> http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf
>
>    I can tell you from experience with similar algorithms that these 
> algorithms do not work well in user-space. To get the performance boost, 
> you need more control over your environment and the low-level scheduling 
> behavior than is possible to get in that environment. (Though they may get 
> you real benefits in kernel space.)

Actually, both SMR and RCU can work well in user-space. Here are two working 
full-blown examples:


http://appcore.home.comcast.net/
( User-Space SMR )


http://atomic-ptr-plus.sourceforge.net/
( User-Space RCU. Joe Seigh has a nice lock-free read test setup. )




>    I can also tell your from experience that it is often the case that 
> when you think you need this type of alogorithm, what you really need to 
> do is rearchitect or make other subtle changes. Avoiding contention in the 
> first place is better than algorithms that make contention less expensive.

Well, these algorithms can be used to provide scaleable solutions to the 
reader/writer problem. In particular, the iteration of read-mostly 
data-structures. I can't seem to design a lock-based solution that can get 
equivalent read-side performance numbers for this special case. If you can 
present a lock-based solution for iterating read-mostly data structures that 
can beat these types of algorithms I will gladly post it to my site. My site 
is about mixing lock-free with lock-based, so a high-speed lock-based 
solution to the reader-writer problem would be greatly welcome.




>    Of course, I'm not familiar with this exact algorithm or your exact 
> application. So you may have the 2% case where the usual guidelines break. 
> Are you *sure* portable algorithms with mutexes won't get you sufficient 
> performance? (Or is this a research or proof of concept kind of thing?)

This is pure research. I also provide a user-space demo library as a proof 
of concept.




>    I would also warn you against making design changes based upon 
> benchmarks on unrealistic workloads. Many lock-free algorithms, for 
> example, perform better than locking algorithms on unrealistic workloads. 
> On realistic workloads, the descheduling of a conflicting thread allows 
> both threads to run with no further conflicts. This is much more efficient 
> than cache ping-ponging as both threads operate on the same data even with 
> no locking cost.

Contended locks can suffer from similar problems, in certain situations...

Take a global read-write lock protecting a mostly-read data-structure. Under 
heavy read contention, forward progress can be stalled waiting for the 
updates to the internal lock state to be shuttled around the CPU's caches. 
If the size of the read-side critical section is moderate, the cost of 
moving lock updates from one CPU's cache to another cal rival that of the 
critical section itself; This can prevent readers from truly executing in 
parallel.




>    Descheduling a thread that's conflicting with another thread is *good* 
> and running it later when it won't conflict with anything is *great*. 
> Running both threads slowly with a lock-free algorithm is bad because work 
> that must be done slowly is done instead of work that can be done quickly. 
> (Of course, you don't see this on unrealistic workloads because there's 
> nothing else to do.)

Lock-free is not good for everything; An efficient marriage between 
lock-free and lock-based can be better than using one or the other. Take a 
look at this:

http://groups.google.ca/group/comp.programming.threads/msg/7e7834ca10f2613a?hl=en
( read the HP paper describing the performance differences of lock-free with 
lock-based, and how a mix of both worlds can help. )

.