Weird process hang/loop
807567Jul 30 2009 — edited Jul 31 2009All -
We are running Solaris 10 on a T5240 2 cpus, 8 cores, configured to run 128 virtual cpus. One of the programs we run successfully on our Solaris 8 platform is experiencing odd behavior on our new box. Intermittent executions of the program causes a "hang". The high level sequence of events: a shell script calls a cobol program, the cobol program hangs. I ran truss -p against the pid for the cobol program. Two different hang scenarios revealed two different truss outputs.
Output 1 :
stat("M$FST345.TMP", 0xFFFFFFFF7FFFD830) Err#2 ENOENT
open("M$FST345.TMP", O_RDWR|O_CREAT, 0666) Err#13 EACCES [file_dac_write]
stat("M$FST346.TMP", 0xFFFFFFFF7FFFD6E8) Err#2 ENOENT
stat("M$FST346.TMP", 0xFFFFFFFF7FFFD830) Err#2 ENOENT
open("M$FST346.TMP", O_RDWR|O_CREAT, 0666) Err#13 EACCES [file_dac_write]
stat("M$FST347.TMP", 0xFFFFFFFF7FFFD6E8) Err#2 ENOENT
stat("M$FST347.TMP", 0xFFFFFFFF7FFFD830) Err#2 ENOENT
open("M$FST347.TMP", O_RDWR|O_CREAT, 0666) Err#13 EACCES [file_dac_write]
This is repeated until "M$FST" reaches 999 then it starts over - the output screams out and the pid itself is getting tons of cpu time.
The second hang scenario's truss output is:
nanosleep(0xFFFFFFFF7FFF7670, 0xFFFFFFFF7FFF7660) = 0
lseek(115, 0, SEEK_SET) = 0
read(115, "FE S0202040403FF\0010702".., 241) = 241
fcntl(114, F_SETLK, 0xFFFFFFFF7FFF7130) Err#11 EAGAIN
nanosleep(0xFFFFFFFF7FFF7670, 0xFFFFFFFF7FFF7660) = 0
lseek(115, 0, SEEK_SET) = 0
read(115, "FE S0202040403FF\0010702".., 241) = 241
fcntl(114, F_SETLK, 0xFFFFFFFF7FFF7130) Err#11 EAGAIN
The same info is repeated - although the output from this truss iteration is much slower.
I apologize for what is almost for sure a lack of detail here - I am a DBA helping out :-) Any advise is appreciated. The processes are still hung at this moment.
Thanks -
Mike