How to get per-cache JMX management data on very large clusters
749168Jan 31 2011 — edited Feb 2 2011I have a large cluster (>250 nodes, >1000 caches) and am attempting to implement workable JMX management for it. From my experiments, it looks like the usual Coherence JMX setup (-Dtangosol.coherence.management.remote=true on all storage and proxy nodes, and then one management node to be an mbean server for all of the other nodes) is not going to work for a cluster of this size.
It is my understanding that all attempts to snapshot data from the management node will result in individual requests for each mbean the management node knows about, which in our case by default will be >500,000 mbeans, and so >500,000 individual requests to invoke and collate the results for. This either kills the management node due to running out of memory, or else requires that the refresh delay of management data be set so high (minutes or tens of minutes at least) that the data becomes much less useful.
I have come up with a few solutions that I believe I can implement immediately that can work around this problem (I know a third party solution like RTView would also be worth investigating, but in this thread I want to focus on solutions that can be achieved with Coherence alone), and I am interested in feedback from anyone with good understanding of Coherence JMX issues or with experience in managing large Coherence clusters:
1. Filtering out all per-cache (Cache MBeans and StorageManager MBeans) and all other n^2 mbeans (PointToPoint MBeans) from ever being registered on the managed nodes using the com.tangosol.net.management.ObjectNameExcludeFilter. This should take care of the memory and responsiveness issues on the management node, but that leaves us with the new question of what alternative means do we have to be able to get at the data that those mbeans would have given us? Does anyone know a way to at least get the size-in-bytes of caches and other important per-cache stats without going through JMX?
2. Run local-only mbean servers on every storage/proxy node in addition to the managed node, and only have the managed node ever access NON-per-cache nodes. The storage/proxy nodes can locally access the cache mbeans to get the data from them and aggregate them into a one-per-node mbean that has all of the data of all the caches on that node. This works in theory, but it concerns me greatly that anyone with access to the management node may have the ability to kill management in the cluster by accidentally querying the wrong nodes from it. I would prefer a solution where no standard JMX requests to the management node can end up breaking the cluster.
3. Similar to #2, is there any way for certain mbeans to only be visible on local mbean servers on storage/proxy nodes, but then will never be seen by the remote mbean server on the management node? Using the com.tangosol.net.management.ObjectNameExcludeFilter does not appear to provide this functionality. If you filter out cache mbeans on the management node but do not filter them out on the storage nodes, then the management node can still see them and access them. Are there other ways that these might be blocked that I am not aware of?
4. Run all storage/proxy nodes with their own local-only mbean server and -Dtangosol.coherence.management.remote=false, and run no management node. Aggregate the information at the per-node level locally on each node, and then forward the aggregated data by some other means, most likely by storing in a Coherence cache and accessing via an extends connection. I am leaning most toward this solution because it makes it easy for me to get any piece of the default per-cache mbean data that we find we need either now or in the future, it makes it easy to monitor how much performance impact management is having as it is just another set of normally-measured Coherence operations, and the connections into the cluster for the purposes of management are protected and have load-balancing and failover in the same way all of our other connections into the cluster have. It appears to me the data load of this information is really not that high, and anyway, the whole reason we have Coherence is that it is good at filing away huge amounts of data like this and providing a lot of connections into that data.
Or is this totally crazy?