Java HotSpot Virtual Machine

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Thread blocked in NIO non-blocking read, possibly related to GC

843829May 25 2010 — edited May 25 2010

Hello,
I've come across a pretty odd (to me) situation that threads performing non-blocking IO operations blocked occasionally, for several seconds.
After some observation I think this might be related to GC behavior of the JVM, though I don't know why. Could anyone experienced with the problem help me, please!
Here's the story.
I'm implementing a server, which using NIO (via modified thrift, to be specified) to handle networking. The selector thread selects, and when something is ready, submits tasks to a ThreadPoolExecutor which in turn deals with network IO and executes application logic.

At first, I found something in my log:

16:59:19.246 [pool-1-thread-5] INFO  o.a.t.server.THsHaServer - slow read schedule:10110 sun.nio.ch.SelectionKeyImpl@48fd918a
16:59:19.246 [pool-1-thread-3] INFO  o.a.t.server.THsHaServer - slow read schedule:10110 sun.nio.ch.SelectionKeyImpl@29f01a29
...

It means that there are 11010 milliseconds between submiting the read task to the ThreadPoolExecutor and completion of the task. However normally such a task is done in 10 ms or even 1 ms.
This happend once every several minutes. Everytime it happens, there are the same amount of slow tasks as the amount of client connections. Looks like all the workers in thread pool paused for a moment.

So I turned on PrintGCDetail, jstat and jstack and found the following fact:
1. Everytime the delay happens, It happens with CMS. But not vice versa. sometimes CMS goes silently, as expected.
2. Everytime the delay happens, the slow log is printed exactly as soon as ASCMS-concurrent-sweep is done. The slow task is started almost exactly at the end of a certain ASParNew before ASCMS-initial-mark -- that is, the ASParNew immediately before initial-mark, or the one before that one. For example, the first slow task above is submited at: 16:59:19.246 - 10.110 = 16:59:9.136
Here is related GC information in stdout:

2010-05-25T16:59:09.136+0800: 3651.412: [GC 3651.412: [ASParNew: 320443K->36252K(344256K), 0.0545040 secs] 1505365K->1226646K(2048192K), 0.0546100 secs] [Times: user=0.20 sys=0.00, real=0.05 secs]
Total time for which application threads were stopped: 0.0549950 seconds
2010-05-25T16:59:09.192+0800: 3651.468: [GC [1 ASCMS-initial-mark: 1190393K(1703936K)] 1227409K(2048304K), 0.0260460 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
Total time for which application threads were stopped: 0.0263550 seconds
2010-05-25T16:59:09.219+0800: 3651.495: [ASCMS-concurrent-mark-start]
2010-05-25T16:59:12.422+0800: 3654.698: [ASCMS-concurrent-mark: 3.203/3.203 secs] [Times: user=3.21 sys=0.09, real=3.20 secs]
2010-05-25T16:59:12.422+0800: 3654.698: [ASCMS-concurrent-preclean-start]
2010-05-25T16:59:12.431+0800: 3654.707: [ASCMS-concurrent-preclean: 0.009/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2010-05-25T16:59:12.431+0800: 3654.708: [ASCMS-concurrent-abortable-preclean-start]
Total time for which application threads were stopped: 1.0597940 seconds
Total time for which application threads were stopped: 0.0002230 seconds
Total time for which application threads were stopped: 0.0000810 seconds
Total time for which application threads were stopped: 1.0611210 seconds
Total time for which application threads were stopped: 0.0001800 seconds
Total time for which application threads were stopped: 0.0000720 seconds
 CMS: abort preclean due to time 2010-05-25T16:59:17.571+0800: 3659.847: [ASCMS-concurrent-abortable-preclean: 1.281/5.139 secs] [Times: user=3.52 sys=0.08, real=5.14 secs]
2010-05-25T16:59:17.571+0800: 3659.847: [GC[YG occupancy: 65494 K (344368 K)]3659.847: [Rescan (parallel) , 0.0180070 secs]3659.865: [weak refs processing, 0.0000240 secs] [1 AS
CMS-remark: 1190393K(1703936K)] 1255887K(2048304K), 0.0181460 secs] [Times: user=0.07 sys=0.00, real=0.02 secs]
Total time for which application threads were stopped: 0.0184440 seconds
2010-05-25T16:59:17.590+0800: 3659.866: [ASCMS-concurrent-sweep-start]
2010-05-25T16:59:19.246+0800: 3661.522: [ASCMS-concurrent-sweep: 1.656/1.656 secs] [Times: user=1.72 sys=0.05, real=1.66 secs]
2010-05-25T16:59:19.246+0800: 3661.523: [ASCMS-concurrent-reset-start]
2010-05-25T16:59:19.259+0800: 3661.535: [ASCMS-concurrent-reset: 0.012/0.012 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
Total time for which application threads were stopped: 1.0517150 seconds
Total time for which application threads were stopped: 0.0002160 seconds
Total time for which application threads were stopped: 0.0000620 seconds

And here is the jstat output:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
24704.0 48516.0 24703.8  0.0   295740.0 62679.8  1703936.0  1184922.1  24528.0 14602.7    404   24.809   6      0.229   25.038
24704.0 48516.0 24703.8  0.0   295740.0 202706.9 1703936.0  1184922.1  24528.0 14602.7    404   24.809   6      0.229   25.038
48848.0 48516.0  0.0   36252.8 295852.0  4384.6  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0  7182.7  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 11674.6  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 16039.5  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 20402.4  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 24610.4  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 24771.9  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 25977.4  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 28669.0  1703936.0  1190393.3  24528.0 14602.7    405   24.864   7      0.255   25.118
48848.0 48516.0  0.0   36252.8 295852.0 32718.5  1703936.0  1116044.3  24528.0 14602.7    405   24.864   8      0.273   25.137

3. When the delay happens, all worker threads in pool is jammed in nio:

"pool-1-thread-5" prio=10 tid=0x000000004b48f800 nid=0x3e58 waiting on condition [0x0000000042873000]
   java.lang.Thread.State: RUNNABLE
        at java.nio.Bits.copyToByteArray(Native Method)
        at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224)
        at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191)
        at sun.nio.ch.IOUtil.read(IOUtil.java:209)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        - locked <0x00002aaac66efb88> (a java.lang.Object)
        at org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:154)
            ...
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

   Locked ownable synchronizers:
        - <0x00002aaac66c0248> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

All worker threads are in this state.

So what's all these threads are waiting for? Why is it happens at GC and resumes at the end of ASCMS-concurrent-sweep?
I'm totally lost here.. Any clue is appreciated, thanks in advance!

Locked Post

New comments cannot be posted to this locked post.

Locked on Jun 22 2010

Added on May 25 2010

#java-virtual-machine-jvm

4 comments

2,477 views