Subj : Re: Threads and processes on Linux To : comp.programming.threads,comp.os.linux.misc,comp.sources.d From : steve Date : Sat Apr 02 2005 05:16 am In article , doug wrote: > >"Steve Watt" wrote in message >news:d2i3ub$16n8$1@wattres.Watt.COM... >> In article , >> doug wrote: >> [ sneck! ] >>>I'm going to be cheeky and ask you a question that I posted below about a >>>multithreaded bit of code I'm writing. Just incase you didn't see it :) >>>I've got a 2.4 kernel with linuxthreads, and an app with about 300 >>>threads >>>doing network I/O in 20 millisecond bursts. On a single CPU box, this is >>>fine. On an SMP box, performance drops through the floor. There are no >>>data pages shared between threads. vmstat, etc. show the processors 60% >>>idle. >> >> Profiling shows ... what? >> >>>My theory is that each thread is being repeatedly scheduled on a different >>>CPU, and so a lot of time is being spent loading the memory accessed by >>>the >>>thread into the CPU cache, and then (once it's dirtied) invalidating the >>>cache entries on the last processor to host it. Am I in the right >>>ballpark? >>>Even playing the right sport? >> >> I would look more closely at synchronization, implicit or explicit. If >> you're doing a lot of network I/O, it's possible that the overhead is >> going into the network stack (I don't know if the Linux network stack is >> multi-threaded), so you're possibly winding up waiting for that a fair >> amount. [ snip ] >Thanks for the reply Steve. > >Profiling didn't show much. gprof numbers were all low. I even used the >Intel VTune app to measure perf. If showed 60% of the machine clock ticks >being consumed by the idle process (pid 0). There were no hotspots >anywhere - either user code or kernel. VTune may have been able to show me >the problem, but it's got thousands of metrics and I'm not sure which >combination would show it. > >I understand what you're saying about mutex, etc. contention, but there is >no synchronisation in the 20ms thread cycle. Your idea about >synchronisation in the network stack is a good one, though. I'll look into >that. > >The cache line pinging is what I think I'm hitting - since if I bind all >procesors to the same CPU I get much better performance. The interrupt >thing is a guess, but it's all I can come up with - since all my code is >running on a single CPU and it's still not as good as uniprocessor, >*something* of my code has to be running on the other CPUs, no? > >I'll check out that network stack. If you were losing to cache line or other hardware effects, you wouldn't see "idle" time. The profile timer can't tell the difference between a thread that's stalled waiting for a cache line and a thread that's in the middle of a loop of computations -- that level of introspection doesn't exist in most processors. Therefore, I would even more strongly point to the network stack -- threads that are waiting on some lock in the stack _will_ be "idle" in the OS's view. -- Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' 14.9" Internet: steve @ Watt.COM Whois: SW32 Free time? There's no such thing. It just comes in varying prices... .