Subj : Re: Threads and processes on Linux
To   : comp.programming.threads,comp.os.linux.misc,comp.sources.d
From : doug
Date : Fri Apr 01 2005 09:32 am


"Steve Watt" <steve@Watt.COM> wrote in message 
news:d2i3ub$16n8$1@wattres.Watt.COM...
> In article <Vr%2e.6917$C12.5958@fe1.news.blueyonder.co.uk>,
> doug <noone@nowhere.co.uk> wrote:
> [ sneck! ]
>>I'm going to be cheeky and ask you a question that I posted below about a
>>multithreaded bit of code I'm writing.  Just incase you didn't see it :)
>>I've got a 2.4 kernel with linuxthreads, and an app with  about 300 
>>threads
>>doing network I/O in 20 millisecond bursts.  On a single CPU box, this is
>>fine.  On an SMP box, performance drops through the floor.  There are no
>>data pages shared between threads.  vmstat, etc. show the processors 60%
>>idle.
>
> Profiling shows ... what?
>
>>My theory is that each thread is being repeatedly scheduled on a different
>>CPU, and so a lot of time is being spent loading the memory accessed by 
>>the
>>thread into the CPU cache, and then (once it's dirtied) invalidating the
>>cache entries on the last processor to host it.  Am I in the right 
>>ballpark?
>>Even playing the right sport?
>
> I would look more closely at synchronization, implicit or explicit.  If
> you're doing a lot of network I/O, it's possible that the overhead is
> going into the network stack (I don't know if the Linux network stack is
> multi-threaded), so you're possibly winding up waiting for that a fair
> amount.
>
> The first 90% of the time, the problem you are seeing happens when there
> is heavy mutex/other sync primitive contention.  The reason it runs
> better on a uniprocessor is that due to the nature of the time slices,
> threads wind up blocking in places (like I/Os) where they don't have
> things locked.  When the box has multiple processors, suddenly the real
> contention rates go up.
>
> The next 9% of the time, the problem is cache line thrashing.
>
> The last 1% is various oddball effects, such as you're theorizing with
> interrupt affinity.  It's not impossible that it's the problem, but it's
> also not very likely.
>
> It's been a while since I looked at Linux profiling tools, so I don't
> know how well gprof will show where the time is going, but that should
> be your first line of attack.  There were once companies (AMC, at least)
> that did hardware profiling tools, which made debugging your sort of
> problem somewhat easier, but I don't think they're around any more.
> -- 
> Steve Watt KD6GGD  PP-ASEL-IA          ICBM: 121W 56' 57.8" / 37N 20' 
> 14.9"
> Internet: steve @ Watt.COM                         Whois: SW32
>   Free time?  There's no such thing.  It just comes in varying prices...

Thanks for the reply Steve.

Profiling didn't show much.  gprof numbers were all low.  I even used the 
Intel VTune app to measure perf.  If showed 60% of the machine clock ticks 
being consumed by the idle process (pid 0).  There were no hotspots 
anywhere - either user code or kernel.  VTune may have been able to show me 
the problem, but it's got thousands of metrics and I'm not sure which 
combination would show it.

I understand what you're saying about mutex, etc. contention, but there is 
no synchronisation in the 20ms thread cycle.  Your idea about 
synchronisation in the network stack is a good one, though.  I'll look into 
that.

The cache line pinging is what I think I'm hitting - since if I bind all 
procesors to the same CPU I get much better performance.  The interrupt 
thing is a guess, but it's all I can come up with - since all my code is 
running on a single CPU and it's still not as good as uniprocessor, 
*something* of my code has to be running on the other CPUs, no?

I'll check out that network stack.

Ta again,
Doug

.