Subj : Re: Problem with system() calls in a multithreaded program on HPUX 11 To : comp.programming.threads,comp.sys.hp.hpux From : vasanth Date : Fri Feb 25 2005 07:40 pm Hello Mahesh >From the problem description, I feel the same code works fine under Solaris platform. Isn't it ? And, there is no kernel modules of your own. Isn't it ? Once I came to know that there is a known issue in the HP-UX 11/11i that the application may hang up when it is heavily threaded. There is a patch available for this issue I think. Please check-up the patch level of your HP-UX system. If it is not updated to the latest patch level, please update. Also please check the following: 1. The patch level of your system and the latest patches annouced by the HP. Ask your system administrator to ensure that it is the latest for the HP-UX 11.0. 2. When the system is hung, you may take a ToC dump. You can examine the ToC dump using the q4 utilty by generating the report. From the q4 report you can check where exactly the threads of your process are waiting. Since your code is multi-threaded and uses the system(3C), there may be a chance of threads waiting in the kernel in a STOP state. The system(3C) function consists of fork(2) and exec(2). The system calls like fork(2), exit(2), pause(2) etc will cause all other threads in that process to the STOP state. If the STOPed thread don't get to RUN state again, the process will hang. This is one of the possible scenarios. There are a few more scenarios like race condition during the thread exit/cancel either in the pthread lib/in the kernel that could lead to a process hang. As you said when you remove system(3C) or running on Solaris it works fine, first check you system patch level. hth regards, vasanth. Mahesh Kumar wrote: > Hello, > > I am porting a multithreaded program to HPUX 11 from Solaris, in > which threads make calls to system() functions. The program basically > creates a number of threads and runs them specified number of times. > Each thread performs some task and creates a trace file. The threads > then verify the trace file against standard ones to check whether the > run was successful or not. The number of threads is variable. > > Problem: > ======== > As I increase the number of threads, the program hangs, while trying > to run some system() call. If I remove the system() calls altogether, > the program runs fine. Below is an explanation of the problem followed > by the code. The program uses pthreads and runs fine on Solaris. > > > There may be some variable names used in explanation. These names > appear in code given after that. > > Explanation: > ============ > If I increase the value of noOfThreads to say 3, 4 and so on. The > program hangs say around when noOfThreads is 6 or 7. Now as the > problem occurs, two three defunct processes are created. I ran "ps -f > -u" command and output was something like this (mtreg is the name of > above program) > -bash-2.05b$ ps -f -u mkumar > UID PID PPID C STIME TTY TIME COMMAND > mkumar 1726 1190 0 00:06:12 pts/ta 0:10 mtreg > mkumar 1190 1189 0 23:04:02 pts/ta 0:01 -bash > mkumar 1731 1726 0 00:06:20 pts/ta 0:00 > mkumar 1730 1726 2 00:06:20 pts/ta 0:00 > mkumar 1743 0 0 00:06:20 pts/ta 0:00 mtreg > mkumar 1741 1726 0 00:06:21 pts/ta 0:00 sh -c perl strip.pl > /export/home/configdev/tmp/FAAa01726mod0 > mkumar 1742 1741 0 00:06:21 pts/ta 0:00 perl strip.pl > /export/home/configdev/tmp/FAAa01726mod0456a.m > mkumar 1751 1190 5 00:07:48 pts/ta 0:00 ps -f -u mkumar > > Before hanging the output at the console was: > ================================================================================ > Running perl strip.pl /export/home/configdev/tmp/EAAa07614mod0456a.myt > Running perl strip.pl /export/home/configdev/tmp/DAAa07614mod0456a.myt > Running perl strip.pl /export/home/configdev/tmp/AAAa07614mod0456a.myt > Running perl strip.pl /export/home/configdev/tmp/CAAa07614mod0456a.myt > Finished running: perl strip.pl > /export/home/configdev/tmp/EAAa07614mod0456a.myt > Running diff -w mod0456a.trc > /export/home/configdev/tmp/EAAa07614mod0456a.myt > > /export/home/configdev/tmp/EAAa07614mod0456a.myt.diff > Finished running diff -w mod0456a.trc > /export/home/configdev/tmp/EAAa07614mod0456a.myt > > /export/home/configdev/tmp/EAAa07614mod0456a.myt.diff > Running perl strip.pl /export/home/configdev/tmp/BAAa07614mod0456a.myt > Finished running: perl strip.pl > /export/home/configdev/tmp/DAAa07614mod0456a.myt > Running diff -w mod0456a.trc > /export/home/configdev/tmp/DAAa07614mod0456a.myt > > /export/home/configdev/tmp/DAAa07614mod0456a.myt.diff > Finished running: perl strip.pl > /export/home/configdev/tmp/AAAa07614mod0456a.myt > Running perl strip.pl /export/home/configdev/tmp/FAAa01726mod0456a.myt > Running diff -w mod0456a.trc > /export/home/configdev/tmp/AAAa07614mod0456a.myt > > /export/home/configdev/tmp/AAAa07614mod0456a.myt.diff > Finished running diff -w mod0456a.trc > /export/home/configdev/tmp/DAAa07614mod0456a.myt > > /export/home/configdev/tmp/DAAa07614mod0456a.myt.diff > Running diff -w mod0456a.trc > /export/home/configdev/tmp/CAAa07614mod0456a.myt > > /export/home/configdev/tmp/CAAa07614mod0456a.myt.diff > ================================================================================ > > Now some things that I observed are: > 1. I started only one mtreg process (PID 1726). But when the program > hanged, there is one more mtreg process with PPID 0 which is there. It > was idle. > 2. Each time the program hangs, there are one or more defunct > processes. > 3. I am unable to kill the program once it hangs, and system has to be > rebooted. > 4. The number of threads for which the program hangs is not fixed. It > can hang at 5, 6 ,7 or 8 threads. It even hanged once for only 4 > threads > 5. Although last statement is Running "diff", it has not yet started. > 6. I tried an experiment in which I removed all the system() function > calls, and instead placed fclose(fopen(diffFileName, "w")). It meant > just creating the file without doing anything. > This time I was able to run the program even with 10 threads each > doing 10 iterations. And it seems that the program might run fine for > any number of threads. ( I checked uptil 15 threads). > > ================================================================================ > > CODE: ( The code is representative of whole code. It may not be > compilable) > > > #include > #include > #include > #include > > #define noOfThreads 1 > #define noOfIterations 1 > > char outFileName[512]; > char standardTraceFile[512]; > > void * threadStartRoutine(void* p); > > void doOneIterationOfThread(); > > /* > * Creates a number of threads and runs them. Waits for their > completion and then exits. > */ > void createThreadsAndRun() > { > pthread_t threadList[noOfThreads]; > for(int i=0; i { > pthread_attr_t attr; > pthread_attr_init(&attr); > pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM ); > pthread_create(&threadList[i],&attr, threadStartRoutine, nextReq()); > cerr << "start thread " << i << endl; > } > for (i = 0; i < noOfThreads; i++) > { > pthread_join(threadList[i], NULL); > cerr << "finish thread " << i << endl; > } > > } > > int main(int argc, char* argv) > { > //The arguments are not shown, as the functions are just > representative of the function they are intended to perform > setOutFileName(); //depending on argc and argv set the value of > outFileName; outFileName is the filename for trace file > setStandardTraceFileName(); //obtained from one of the arguments. > sets the variable standard trace file name > createThreadsAndRun(); > } > > void * threadStartRoutine(void* p) > { > char* prefix = tempnam(NULL,""); > sprintf(newTraceFile, "%s%s", prefix, outFileName); // set the > tracefile name > for(i = 0; i { > //do some initializations > if(!doOneIterationOfThread()) > { > cerr<<"Run failed: "< } > else > { > cerr<<"Run suzzessful: "< } > } > } > > int doOneIterationOfThread() > { > doCoreWork(); //writes trace into the tracefile with actual values > if(verifyTrace(standardTraceFile, newTraceFile) != 0) //to verify > this run of the thread > { > return false; > } > else > { > return true; > } > } > > /* > * tracefile names are with full path > */ > int verifyTrace(char* standardTraceFile, char* newTraceFile) > { > char cmd[512]; > char diffFileName[512]; > sprintf(diffFileName, "%s.diff", newTraceFile); > > sprintf(cmd, "perl strip.pl %s", newTraceFile); > cerr<<"Running "< system(cmd); > cerr<<"Finished running: "< > sprintf(cmd, "diff -w %s %s > %s", standardTraceFile, newTraceFile, > diffFileName); > cerr<<"Running "< system(cmd); > cerr<<"Finished running: "< > struct stat buf; > stat(diffFileName, &buf); > > unlink(diffFileName); > > if(buf.st_size == 0) > return true; > else > return false; > } > > /* > * Note ****************** > * "perl strip.pl " actually brings the file into > a normalized form. It means, that it > * changes the values that are run dependent in the trace file, like > time stamps and some other info to predecided > * normal value. ( Like time stamps may be converted to 0x0) This > makes the new trace file and standard trace file > * comparable. strip.pl (perl script) performs this task by > substituting regular expressions. > * Note Ends ************* > */ > > ================================================================================ > > Can anyone please tell me why system() calls are causing problem in > HPUX 11 whereas the same thing runs fine on Solaris? It would be > really great if you can suggest a possible solution? > > Thanks and regards, > > Mahesh Kumar .