Subj : Optimizing for latency To : comp.programming.threads From : scream29125 Date : Tue Jun 07 2005 09:33 am I have a problem that is related to threading, and it involves a combination the CPU and external devices. I'm running Redhat Enterprise 3 (Native POSIX threads) on Pentium 4 3.0GHz CPUs. This is my problem setup. In this setup, I am programming a regular PC to control external devices. The CPU in the PC acts as a controller that sends jobs to this external device. This device will do the computationally-intensive job, and report the results to the CPU. Each job takes 100-200 milliseconds to complete. In this setup, the CPU acts in a supporting role. It parses job instructions and converts the parameters into a form the external device understands. It sends these instructions and parameters to the device, along with the data that the device needs. While the device is doing the job, the CPU prepares for the next job. At the same time, the device may also request for more data/parameters and the CPU must respond to such requests. Finally, the device tells the CPU that a job is complete, and the CPU fetches the results from the device. The CPU then initializes the device for the next job and starts this new job. While this sounds like a lot of work for the CPU, it really is not. The CPU only coordinates the jobs, and the external device does all the heavy processing. In setup #1, I connected 4 CPUs (ie 4 PCs) to 4 external devices, like this: CPU A <-> device A CPU B <-> device B CPU C <-> device C CPU D <-> device D Each CPU runs a 2-thread program: one thread services requests from the device (as described above), and the other thread prepares for the next job. With this setup, I obtained a certain level of performance that is satisfactory. So far so good. I observed that in this setup, each CPU is only busy about 10% of the time. So I modified the hardware setup into setup #2, which is to connect 1 CPU to 4 external devices, like this: CPU <-> device A, B, C and D This single CPU will be responsible for scheduling, computing parameters and responding to requests from all 4 devices. I created 4 threads to service these 4 devices, and another thread to do preparation work for new jobs (total = 5 threads). When I run this setup, I find that the CPU is busy about 50% of the time, which is expected. However, I find that overall system performance is poor compared to setup #1. Jobs take longer to finish, some as much as 2 times longer. Since the CPU is only 50% busy, why is this happening? This is not a bandwidth problem because each device is connected to the CPU with its own dedicated data transmission channel. My guess is this. Take for example when device A requests for some additional data. As the CPU is servicing this request, device B makes a similar request. At this point, device B will have to wait (idle) because no matter how many threads I use, there is only one physical CPU. Similarly, devices C and D would have to wait too if they happen to request for something at this point. Even though these wait times are short (a few milliseconds), there can be many wait periods in one job, and they add up to cause the slowness I'm seeing. The CPU can be idle for a long time (50% idle), and suddenly receive simultaneous request from 2 or more devices, causing jobs to be delayed. What can I do to setup #2 to get the performance (as closely as possible) I was getting in setup #1? Or is there another reason why setup #2 is performing poorly? Thanks. .