Questions about performance/memory usage/stability issues of JMX management
749168Jan 26 2011 — edited Jan 31 2011I am having difficulty understanding the exact operation of JMX in our cluster, particularly as it pertains to the responsiveness and stability of the management node.
I believe that we are doing things as per recommended best practices for management of a large production cluster. That is, we have a bit over 250 combined storage-only and proxy-only nodes that are running with the switch "-Dtangosol.coherence.management.remote=true".
And we have one management-only node that is running with a commandline including:
-Dcom.sun.management.jmxremote
-Dtangosol.coherence.management=all
-Dtangosol.coherence.management.remote=true
-Dtangosol.coherence.management.remote.connectionport=7250
-Dtangosol.coherence.management.remote.host=<whichever host>
-Dtangosol.coherence.management.remote.registryport=7251
com.tangosol.net.management.MBeanConnector
-rmi
Primarily I am trying to resolve the issue of why the management node often becomes very unresponsive to tools that try to connect to it and/or crashes with an OutOfMemory exception. But beyond that, I want to understand exactly what the JMX is doing on both the normal cluster nodes and also on the management node so that I can make predictions about the performance and memory cost of adding new caches, adding new custom mbeans or mbean attributes, and adding new nodes to the cluster.
1. Am I correct in understanding that our setup is the best practice, and that this means that the non-management nodes should have very little impact on them from any amount of JMX activity as they are only serving Management service invocations?
2. What are the main memory considerations for the management node? I don't want to just up the heapsize of the node until I understand what are the major factors that contribute to its memory usage so that I can be sure what I up the heapsize to will be sufficient to run indefinitely. (I've currently got it at 1gb and am getting the OOM exception.)
3. How much of a factor is the number of caches in the cluster in both the performance of the management node AND also any other parts of Coherence that could cause problems. Are thousands of caches a regular usage pattern, or is that expected to strain performance/stability?
4. We have observed that sometimes when the management node becomes unresponsive or terminates with OOM, afterwards all of the storage nodes of the cluster will jump to 100% CPU utilization on our linux servers. Could this be related? Does losing the management node have the potential for any data or blocked threads to accumulate in the non-management nodes?
5. What is the best practice for failover of the management node? Just restart it if it goes down, or is there a way to have a second management node that can step in (without incurring the normal operation cost of having duplicate JMX activity)?
6. What kind of tools are available to instrument the performance of JMX itself, outside of JMX? Ie, how can I find out on my own what is causing JMX processing to take too long and use too much memory?
Alternatively, is there a good resource on the performance and stability of Coherence JMX management that I have not been able to find so far in the forums and the Developers guide?
Thanks,
Will Hutchinson