Subj : Optimizing for latency
To   : comp.programming.threads
From : scream29125
Date : Tue Jun 07 2005 09:33 am

I have a problem that is related to threading, and it involves a
combination the CPU and external devices. I'm running Redhat Enterprise
3 (Native POSIX  threads) on Pentium 4 3.0GHz CPUs. This is my problem
setup.

In this setup, I am programming a regular PC to control external
devices. The CPU in the PC acts as a controller that sends jobs to this
external device. This device will do the computationally-intensive job,
and report the results to the CPU. Each job takes 100-200 milliseconds
to complete.

In this setup, the CPU acts in a supporting role. It parses job
instructions and converts the parameters into a form the external
device understands. It sends these instructions and parameters to the
device, along with the data that the device needs. While the device is
doing the job, the CPU prepares for the next job. At the same time, the
device may also request for more data/parameters and the CPU must
respond to such requests. Finally, the device tells the CPU that a job
is complete, and the CPU fetches the results from the device. The CPU
then initializes the device for the next job and starts this new job.
While this sounds like a lot of work for the CPU, it really is not. The
CPU only coordinates the jobs, and the external device does all the
heavy processing.

In setup #1, I connected 4 CPUs (ie 4 PCs) to 4 external devices, like
this:

    CPU A <-> device A
    CPU B <-> device B
    CPU C <-> device C
    CPU D <-> device D

Each CPU runs a 2-thread program: one thread services requests from the
device (as described above), and the other thread prepares for the next
job. With this setup, I obtained a certain level of performance that is
satisfactory. So far so good.

I observed that in this setup, each CPU is only busy about 10% of the
time. So I modified the hardware setup into setup #2, which is to
connect 1 CPU to 4 external devices, like this:

    CPU <-> device A, B, C and D

This single CPU will be responsible for scheduling, computing
parameters and responding to requests from all 4 devices. I created 4
threads to service these 4 devices, and another thread to do
preparation work for new jobs (total = 5 threads). When I run this
setup, I find that the CPU is busy about 50% of the time, which is
expected. However, I find that overall system performance is poor
compared to setup #1. Jobs take longer to finish, some as much as 2
times longer. Since the CPU is only 50% busy, why is this happening?

This is not a bandwidth problem because each device is connected to the
CPU with its own dedicated data transmission channel.

My guess is this. Take for example when device A requests for some
additional data. As the CPU is servicing this request, device B makes a
similar request. At this point, device B will have to wait (idle)
because no matter how many threads I use, there is only one physical
CPU. Similarly, devices C and D would have to wait too if they happen
to request for something at this point. Even though these wait times
are short (a few milliseconds), there can be many wait periods in one
job, and they add up to cause the slowness I'm seeing. The CPU can be
idle for a long time (50% idle), and suddenly receive simultaneous
request from 2 or more devices, causing jobs to be delayed.

What can I do to setup #2 to get the performance (as closely as
possible) I was getting in setup #1? Or is there another reason why
setup #2 is performing  poorly?

Thanks.

.