Subj : Re: Threads and processes on Linux To : comp.programming.threads,comp.os.linux.misc,comp.sources.d From : doug Date : Sat Apr 02 2005 10:35 am "Steve Watt" wrote in message news:d2l9rq$1go3$1@wattres.Watt.COM... > In article , > doug wrote: >> >>"Steve Watt" wrote in message >>news:d2i3ub$16n8$1@wattres.Watt.COM... >>> In article , >>> doug wrote: >>> [ sneck! ] >>>>I'm going to be cheeky and ask you a question that I posted below about >>>>a >>>>multithreaded bit of code I'm writing. Just incase you didn't see it :) >>>>I've got a 2.4 kernel with linuxthreads, and an app with about 300 >>>>threads >>>>doing network I/O in 20 millisecond bursts. On a single CPU box, this >>>>is >>>>fine. On an SMP box, performance drops through the floor. There are no >>>>data pages shared between threads. vmstat, etc. show the processors 60% >>>>idle. >>> >>> Profiling shows ... what? >>> >>>>My theory is that each thread is being repeatedly scheduled on a >>>>different >>>>CPU, and so a lot of time is being spent loading the memory accessed by >>>>the >>>>thread into the CPU cache, and then (once it's dirtied) invalidating the >>>>cache entries on the last processor to host it. Am I in the right >>>>ballpark? >>>>Even playing the right sport? >>> >>> I would look more closely at synchronization, implicit or explicit. If >>> you're doing a lot of network I/O, it's possible that the overhead is >>> going into the network stack (I don't know if the Linux network stack is >>> multi-threaded), so you're possibly winding up waiting for that a fair >>> amount. > [ snip ] >>Thanks for the reply Steve. >> >>Profiling didn't show much. gprof numbers were all low. I even used the >>Intel VTune app to measure perf. If showed 60% of the machine clock ticks >>being consumed by the idle process (pid 0). There were no hotspots >>anywhere - either user code or kernel. VTune may have been able to show >>me >>the problem, but it's got thousands of metrics and I'm not sure which >>combination would show it. >> >>I understand what you're saying about mutex, etc. contention, but there is >>no synchronisation in the 20ms thread cycle. Your idea about >>synchronisation in the network stack is a good one, though. I'll look >>into >>that. >> >>The cache line pinging is what I think I'm hitting - since if I bind all >>procesors to the same CPU I get much better performance. The interrupt >>thing is a guess, but it's all I can come up with - since all my code is >>running on a single CPU and it's still not as good as uniprocessor, >>*something* of my code has to be running on the other CPUs, no? >> >>I'll check out that network stack. > > If you were losing to cache line or other hardware effects, you wouldn't > see "idle" time. The profile timer can't tell the difference between > a thread that's stalled waiting for a cache line and a thread that's > in the middle of a loop of computations -- that level of introspection > doesn't exist in most processors. > > Therefore, I would even more strongly point to the network stack -- > threads that are waiting on some lock in the stack _will_ be "idle" in > the OS's view. > -- Ahh - thanks for that, I didn't realise. It does indeed point to the network stack, eh? OK, I'll look into that in closer detail. Just out of interest, would you be able to hazard a guess as to why my code speeds up a lot when all threads are bound to a single CPU (but still not as good as uniprocessor)? Ta, Doug .