Subj : Re: Threads and processes on Linux To : comp.programming.threads,comp.os.linux.misc,comp.sources.d From : doug Date : Fri Apr 01 2005 09:32 am "Steve Watt" wrote in message news:d2i3ub$16n8$1@wattres.Watt.COM... > In article , > doug wrote: > [ sneck! ] >>I'm going to be cheeky and ask you a question that I posted below about a >>multithreaded bit of code I'm writing. Just incase you didn't see it :) >>I've got a 2.4 kernel with linuxthreads, and an app with about 300 >>threads >>doing network I/O in 20 millisecond bursts. On a single CPU box, this is >>fine. On an SMP box, performance drops through the floor. There are no >>data pages shared between threads. vmstat, etc. show the processors 60% >>idle. > > Profiling shows ... what? > >>My theory is that each thread is being repeatedly scheduled on a different >>CPU, and so a lot of time is being spent loading the memory accessed by >>the >>thread into the CPU cache, and then (once it's dirtied) invalidating the >>cache entries on the last processor to host it. Am I in the right >>ballpark? >>Even playing the right sport? > > I would look more closely at synchronization, implicit or explicit. If > you're doing a lot of network I/O, it's possible that the overhead is > going into the network stack (I don't know if the Linux network stack is > multi-threaded), so you're possibly winding up waiting for that a fair > amount. > > The first 90% of the time, the problem you are seeing happens when there > is heavy mutex/other sync primitive contention. The reason it runs > better on a uniprocessor is that due to the nature of the time slices, > threads wind up blocking in places (like I/Os) where they don't have > things locked. When the box has multiple processors, suddenly the real > contention rates go up. > > The next 9% of the time, the problem is cache line thrashing. > > The last 1% is various oddball effects, such as you're theorizing with > interrupt affinity. It's not impossible that it's the problem, but it's > also not very likely. > > It's been a while since I looked at Linux profiling tools, so I don't > know how well gprof will show where the time is going, but that should > be your first line of attack. There were once companies (AMC, at least) > that did hardware profiling tools, which made debugging your sort of > problem somewhat easier, but I don't think they're around any more. > -- > Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' > 14.9" > Internet: steve @ Watt.COM Whois: SW32 > Free time? There's no such thing. It just comes in varying prices... Thanks for the reply Steve. Profiling didn't show much. gprof numbers were all low. I even used the Intel VTune app to measure perf. If showed 60% of the machine clock ticks being consumed by the idle process (pid 0). There were no hotspots anywhere - either user code or kernel. VTune may have been able to show me the problem, but it's got thousands of metrics and I'm not sure which combination would show it. I understand what you're saying about mutex, etc. contention, but there is no synchronisation in the 20ms thread cycle. Your idea about synchronisation in the network stack is a good one, though. I'll look into that. The cache line pinging is what I think I'm hitting - since if I bind all procesors to the same CPU I get much better performance. The interrupt thing is a guess, but it's all I can come up with - since all my code is running on a single CPU and it's still not as good as uniprocessor, *something* of my code has to be running on the other CPUs, no? I'll check out that network stack. Ta again, Doug .