lock lossage on distributed cache
743514Mar 28 2012 — edited Apr 2 2012I have a cache as follows in a 3.5.2 Coherence environment.
<cache-mapping>
<cache-name>PartitionMappingEventQueues</cache-name>
<scheme-name>DistributedNearCacheStringKeyMultiCacheBackingStoreFrontScheme</scheme-name>
</cache-mapping>
<near-scheme>
<scheme-name>DistributedNearCacheStringKeyMultiCacheBackingStoreFrontScheme</scheme-name>
<invalidation-strategy>auto</invalidation-strategy>
<front-scheme>
<local-scheme>
<scheme-ref>unlimited-backing-map</scheme-ref>
</local-scheme>
</front-scheme>
<back-scheme>
<distributed-scheme>
<scheme-ref>DistributedNearCacheStringKeyMultiCacheBackingStoreBackScheme</scheme-ref>
</distributed-scheme>
</back-scheme>
<autostart>false</autostart>
</near-scheme>
<distributed-scheme>
<scheme-name>DistributedNearCacheStringKeyMultiCacheBackingStoreBackScheme</scheme-name>
<service-name>DistributedNearCacheStringKeyMultiCacheBackingStoreService</service-name>
<backing-map-scheme>
<read-write-backing-map-scheme>
<scheme-ref>CacheStoreDatabaseStringKeyMultiCacheScheme</scheme-ref>
</read-write-backing-map-scheme>
</backing-map-scheme>
<autostart>false</autostart>
</distributed-scheme>
<local-scheme>
<scheme-name>unlimited-backing-map</scheme-name>
</local-scheme>
Hopefully that's enough detail without also providing the cachestore definition.
I'm using coherence locks on the PartitionMappingEventQueues cache to satisfy
a role identity thing for servers. Whoever has the lock at any given time
is the "primary agent" for a role, empowered to do some things on the cluster.
One particular type of server will grab a lock and intends to keep it for the
duration of the process lifetime. This server is normally the de-facto
primary agent when it is up. It never unlocks the lock.
All other cluster members participating in this scheme will attempt to
get the lock when circumstances require it, via trylock(),
and if they fail, they assume the primary agent will do the work.
The problem I'm seeing is that the lock obtained by the de-facto primary agent
appears to get lost when there is sufficient cluster activity. In fact there's
evidence that multiple locks on this cache are not being correctly honored.
It may be that the lock is being lost when we're adding and removing the storage enabled
members for the distributed-scheme service, though we never go to zero storage members
for the service.
The process that intends to keep the lock as long as it's running is NOT a storage
enabled member of the service, if that matters.
All lock usage is strictly pessimistic,
lock() ... do stuff with lock ... unlock()
Have I violated some assumption? How does a process that locks but never unlocks
a lock lose the lock such that some other process can obtain the lock?
We don't ever see a member-left event for the primary-agent lock-holding process.
We have a component that is watching for member-left where MemberEvent.isLocal() is true.
Unfortunately I don't have the coherence log for this to determine whether the cluster thinks
something is afoot, I'm working on reproducing the problem and having the log for it,
but wanted to check and see if there were known problems in 3.5.2 or with the way
I'm using the above schema for locking.
Upgrading from 3.5.2 is not an option for me at the moment.
Edited by: Bullwinkle on Mar 28, 2012 5:32 AM