Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Process in "uninterruptible sleep" State

Linux-RAC-AdminAug 30 2011 — edited Sep 1 2011
We recently migrated a large amount of data from a standalone HPUX 10g database onto our oracle Linux RAC environment 11g ASM database. DBA's are reporting the insert/update batch process they run is taking an abnormal amount of time. In Oracle we see a large abount of 'db_file_sequential_read' waits amoung other wait events for the job including 'gc cr block lost'.

We have investigated this thoroughly from a server perspective. The RAC is three nodes, each with dual-quad cpu, 32GB RAM attached to a Clariion cx3-40 SAN. The interconnect is 1GB ethernet on a private vlan.

When this process is running, I find it spends a lot of time in an "uninterruptible sleep" state -

[root@bkvdw03da system]# ps aux | grep " D"
oracle 13041 4.5 2.9 4934512 988288 ? Ds Aug29 64:00 oraclerptwa3 (LOCAL=NO)

I ran an strace on this pid for a few minutes -

[root@bkvdw03da system]# cat strace-08302011.out
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.28 0.204891 15 13903 pread
0.85 0.001762 1 1395 poll
0.63 0.001317 0 30667 getrusage
0.09 0.000198 0 711 sendmsg
0.09 0.000193 0 8291 times
0.06 0.000119 0 3512 1405 recvmsg
0.00 0.000000 0 1 close
0.00 0.000000 0 1 sched_yield
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 getsockname
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 getsockopt
0.00 0.000000 0 2 fcntl
0.00 0.000000 0 3 semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00 0.208480 58491 1405 total

I am very concerned about what the 1405 recvmsg errors are!

The raw trace data shows that the sendmsg and recvmsg is the traffic on the private interconnect.

getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 3, 1250) = 1 ([{fd=12, revents=POLLIN|POLLRDNORM}])getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
pread(13, "\6\302\0\0\336\356\207\6\207\274H\265\r\0\1\4\344#\0\0\1\0\0\0\351\7\t\0\207\274H\265"..., 16384, 368481632256) = 16384
getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={3511, 531166}, ru_stime={288, 325167}, ...}) = 0
pread(21, "\6\302\0\0r\357\207\6\266g;\253\r\0\1\4\214\353\0\0\1\0\0\0\351\7\t\0t7:\253"..., 16384, 368487202816) = 16384
getrusage(RUSAGE_SELF, {ru_utime={3511, 532165}, ru_stime={288, 325167}, ...}) = 0
sendmsg(6, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38472), sin_addr=inet_addr("172.16.10.3")}, msg_iov(3)=[{"\4\3\2\1\210Z\374\4\0\0\0\0MRON\0\2\0\0\0\0\0\0Q\311\260%l\347\227#"..., 76}, {"\2\10\375\t\375\t\0\0\240\1\0\0\240\1\0\0\0\0\0\0\0\0\0\0\1\0\0\0", 28}, {"KSXP\2\0\0\0\3\0\2\0.\0\0\0\6\0\0\0\261\25'\325\r\0\0\0\214mZ\243"..., 416}], msg_controllen=0, msg_flags=0}, 0) = 520
times({tms_utime=351153, tms_stime=28832, tms_cutime=0, tms_cstime=0}) = 558400964
poll([{fd=10, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=6, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=12, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 3, 16) = 1 ([{fd=6, revents=POLLIN|POLLRDNORM}])
recvmsg(6, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38469), sin_addr=inet_addr("172.16.10.3")}, msg_iov(3)=[{"\4\3\2\1\357\17|\30\0\0\0\0MRON\0\2\0\0\0\0\0\0\327\323TB\0\0\0\0"..., 76}, {"\1\0\0\0\254\20\n\3\330\0\0\0\0\0\0\0\0\0\0\0u\334v\35\0\0\0\0", 28}, {"\0\0\0\0\v\0\0\0Q\311\260%\0\0\0\0\336\177\3\0\0\0\0\0=jx\35\0\0\0\0"..., 216}], msg_controllen=0, msg_flags=0}, 0) = 320
recvmsg(6, 0x7fff73c7c020, 0) = -1 EAGAIN (Resource temporarily unavailable)times({tms_utime=351153, tms_stime=28832, tms_cutime=0, tms_cstime=0}) = 558400964
poll([{fd=10, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=6, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=12, events
*recvmsg(12, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38469), sin_addr=inet_addr("172.16.10.3")}, msg_iov(2)=[{"\4\3\2\1=jx\35\0\0\0\0MRON\0\2\0\0\0\0\0\0\327\323TB\0\0\0\0"..., 76}, {"\2\4\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0\0 \0\0\0\1\0J\340", 28}], msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK) = 104*recvmsg(12, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38469), sin_addr=inet_addr("172.16.10.3")}, msg_iov(3)=[{"\4\3\2\1=jx\35\0\0\0\0MRON\0\2\0\0\0\0\0\0\327\323TB\0\0\0\0"..., 76}, {"\2\4\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0\0 \0\0\0\1\0J\340", 28}, {"\6\2\1\0007U\212\6\261\25'\325\r\0000\221\1\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0", 32}], msg_controllen=0, msg_flags=0}, 0) = 136
recvmsg(12, 0x7fff73c7bb80, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
getrusage(RUSAGE_SELF, {ru_utime={3511, 532165}, ru_stime={288, 325167}, ...}) = 0
pread(21, "\6\302\0\0007U\212\6\27\250x\253\r\0\1\4|\r\0\0\2\0-\0\253\376\10\0\27\250x\253"..., 16384, 369013669888) = 16384
times({tms_utime=351153, tms_stime=28832, tms_cutime=0, tms_cstime=0}) = 558400966
times({tms_utime=351153, tms_stime=28832, tms_cutime=0, tms_cstime=0}) = 558400966


The error appears on two file descriptors -

recvmsg(6, 0x7fff73c7c020, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(12, 0x7fff73c7c020, 0) = -1 EAGAIN (Resource temporarily unavailable)


Checking the file descriptor they are assigned to sockets -

[root@bkvdw03da system]# ls -l /proc/13041/fd
...
lrwx------ 1 oracle dba 64 Aug 30 08:53 12 -> socket:[37494163]
...
lrwx------ 1 oracle dba 64 Aug 30 08:53 6 -> socket:[34959745]


This is pretty much where I get over my head. The Oracle DBA's are investigating with an SR, but i am trying to understand what all this information tells me about this particular process. And why the recvmsg errors are occuring. Is there a tuning parameter on the server i need to look at?

I appreciate any insight!
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Sep 29 2011
Added on Aug 30 2011
1 comment
938 views