Guidelines for Health Monitoring for TimesTen
This document provides some guidance on monitoring the health of a TimesTen
datastore. Information is provided on monitoring the health of the
datastore itself, and on monitoring the health of replication.
There are two basic mechanisms for monitoring TimesTen:
1. Reactive - monitor for alerts either via SNMP traps (preferred) or
by scanning the Timesten daemon log (very difficult) and reacting
to problms as they occur.
2. Proactive - probe TimesTen periodically and react if problems, or
potential problems, are detected.
This document focusses on the second (proactive) approach.
First, some basic recommendations and guidelines relating to monitoring
TimesTen:
1. Monitoring should be implemented as a separate process which maintains
a persistent connection to TimesTen. Monitoring schemes (typically based
on scripts) that open a connection each time they check TimesTen impose
an unnecessary and undesireable loading on the system and are discouraged.
2. Many aspects of monitoring are 'stateful'. They require periodic
sampling of some metric maintained by TimesTen and comparing its
value with the previous sample. This is another reason why a separate
process with a persistent connection is desireable.
3. A good monitoring implementation will be configurable since the values
used for some of the chcks may depend on e.g. the TimesTen configuration
in use or the workload being handled.
MONITORING THE HEALTH OF A DATASTORE
====================================
At the simples level, this can be achieved by performing a simple SELECT
against one of the system tables. The recommended table to use is the
SYS.MONITOR table. If this SELECT returns within a short time then the
datastore can be considered basically healthy.
If the SELECT does not return within a short time then the datastroe is
stuck in a low level hang situation (incredibly unlikely and very serious).
More likely, the SELECT may return an error such as 994 or 846 indicating
that the datastore has crashed (again very unlikely, but possible).
A slightly more sophisticated version would also include an update to a
row in a dummy table. This would ensure that the datastore is also capable
of performing updates. This is important since if the filesystem holding
the trsnaction logs becomes full the datastore may start to refuse write
operations while still allowing reads.
Now, the SYS.MONITOR table contains many useful operational metrics. A more
sphisticated monitoring scheme could sample some of these metrics and
compute the delta between subsequent samples, raising an alert if the
delta exceeds some (configurable) threshold.
Some examples of metrics that could be handled in this way are:
PERM_IN_USE_SIZE and PERM_IN_USE_HIGH_WATER compared to PERM_ALLOCATED_SIZE
(to detect if datastore is in danger of becoming full).
TEMP_IN_USE_SIZE and TEMP_IN_USE_HIGH_WATER compared to TEMP_ALLOCATED_SIZE
(ditto for temp area).
XACT_ROLLBACKS - excessive rollbacks are a sign of excessive database
contention or application logic problems.
DEADLOCKS - as for XACT_ROLLBACKS.
LOCK_TIMEOUTS - excessive lock timeouts usually indicate high levels of
contention and/or application logic problems.
CMD_PREPARES & CMD_REPREPARES - it is very important for performance that
applications use parameterised SQL statements that they prepare just once
and then execute many times. If these metrics are continuously increasing
then this points to bad application programming which will be hurting
performance.
CMD_TEMP_INDEXES - if this value is increasing then the optimiser is
comntinually creating temporary indices to process certain queries. This
is usually a serious performance problem and indicates a missing index.
LOG_BUFFER_WAITS - of this value is increasing over timne this indicates
inadequate logging capacity. Yiou may need to increase the size of the
datastore log buffer (LogBuffSize) and log file size (LogFileSize). If that
does not alleviate the problem you may need to change your disk layout or
even obtain a higher performance storage subsystem.
LOG_FS_READS - this indicates an inefficieny in 'log snoop' processing as
performed by replication and the XLA/JMS API. To alleviate this you should
try increasing LogBuffSize and LogFileSize.
Checking these metrics is of course optional and not necessary for a basic
healthy/failed decision but if you do check them then you will detect more
subtle problems in advance and be able to take remedial action.
MONITORING THE HEALTH OF REPLICATION
====================================
This is a little more complex but is vital to achieve a robust and reliable
system. ideally, monitorting should be implemented at both datstores, the
active and the standby. There are many more failure modes possible for
a replicated system than for a standalone datastore and it is not possible
to ennumerate them all here. However the information provided here should
be sufficient to form the basis of a robist monitoring scheme.
Monitoring replication at the ACTIVE datastore
----------------------------------------------
1. CALL ttDataStoreStatus() and check result set;
If no connections with type 'replication' exists, conclude that
replication agents are stopped, restart the agents and skip
next steps.
It is assumed here that the replication start policy is 'norestart'.
An alarm about unstable replication agents should be raised
if this is Nth restart in M seconds (N and M are configuration parameters).
The alarm can later be cleared when the agents stayed alive K
seconds (K is configuration parameter).
2. CALL ttReplicationStatus() and check result set;
This returns a row for every replication peer for this datastore.
If the pState is not 'start' for any peer, raise an alarm about paused or
stopped replication and skip rest of the steps.
It is assumed that master cannot help the fact that state is not
'start'. An operator may have stopped/paused the replication or
TimesTen stopped the replication because of fail threshold
strategy. In former case the operator hopefully starts the replication
sooner or later (of course, after that TimesTen may stop it again
because of the fail threshold strategy). In latter case the standby
side monitor process should recognise the fact and duplicate the data
store with setMasterRepStart-option which sets state back to 'start'.
If for any peer, lastMsg > MAX (MAX is a configuration parameter), raise
an alarm for potential communication problems.
Note that if replication is idle (nothing to replicate), or there is
very little replication traffic, the value for lastMsg may become as
high as 60 seconds without indicating any problem. The test logic
should cater for this (i.e. MAX must be > 60 seconds).
3. CALL ttBookmark();
Compute the holdLSN delta between the values from this call and the
previous call and if the delta is greater than maximum allowed
(configuration parameter), raise an alarm about standby
that is too far behind. Continue to next step.
Notice that maximum delta should be less than FAILTHRESHOLD * logSize.
4. CALL ttRepSyncSubscriberStatus(datastore, host);
This step is only needed if you are using RETURN RECEIPT or RETURN TWOSAFE
with the optional DISABLE RETURN feature.
If disabled is 1, raise an alarm for disabled return service.
Continue to next step. If RESUME RETURN policy is not enabled we could,
of course, try to enable return service again (especially when DURABLE
COMMIT is OFF).
There should be no reason to reject TimesTen own mechanisms that
control return service. Thus, no other actions for disabled return
service.
Monitoring replication at the STANDBY datastore
-----------------------------------------------
1. CALL ttDataStoreStatus();
If no connections with type 'replication' exists, conclude that
replication agents are stopped, restart the agents and skip
next steps.
It is assumed that replication start policy is 'norestart'.
An alarm about unstable replication agents should be raised
if this is Nth restart in M seconds (N and M are configuration parameters).
The alarm can later be cleared when the agents stayed alive K
seconds (K is configuration parameter).
2. Call SQLGetInfo(...,TT_REPLICATION_INVALID,...);
If the status is 1, this indicates that the active store has marked this store
as failed due to it being too far out of sync due to log FAILTHRESHOLD.
Start recovery actions by destroying the datastore and recreating via a
'duplicate' operation from the active.
3. Check 'timerecv' value for relevant row in TTREP.REPPEERS
If (timerecv - previous timerecv) > MAX (MAX is a configuration parameter),
raise an alarm for potential communication problems.
You can determine the correct row in TTREP.REPPEERS by first getting the
correct TT_STORE_ID value from TTREP.TTSTORES based on the values in
HOST_NAME and TT_STORE_NAME (you want the id corresponding to the active
store) and then using that to query TTREP.REPPEERS (you can use a join if
you like).
The recovery actions that should be taken in the event of a problem with
replication depend on several factors:
1. The application requirements
2. The type of replication configuration
3. The replication mode (asynchronous, return receipt or return twosafe)
that is in use
Consult the Timesten replication guide for information on detailed recovery
procedures for each combination.
================================ END ==================================