Hi,
Webgate on OHS do not start if the OAM primary server is down.
Let me show some evidences before I share my thoughts on this.
Startup evidence
I do have the following error when I try to start OHS:
$ ./opmnctl startall
opmnctl startall: starting opmn and all managed processes...
================================================================================
opmn id=VL-VM-WSA731:6701
Response: 0 of 1 processes started.
ias-instance id=WGLegacy
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--------------------------------------------------------------------------------
ias-component/process-type/process-set:
WGLegacy1/OHS/OHS/
Error
--> Process (index=1,uid=714737196,pid=11519)
time out while waiting for a managed process to start
Log:
/apps/idm/instances/WGLegacy/diagnostics/logs/OHS/WGLegacy1/console~OHS~1.log
So OHS is telling us that a "timeout occured".
Logs
I can see these timeouts in the log files:
The log have a lot of these lines saying "child process xxxx still dis not exit, sending a SIGKILL". Here is an example:
[2017-08-23T14:15:50.3738-04:00] [OHS] [ERROR:32] [OHS-9999] [core.c] [host_id: VL-VM-WSA731] [host_addr: 10.247.166.231] [pid: 11519] [tid: 1] [user: oracle] [VirtualHost: main] child process 11547 still did not exit, sending a SIGKILL
Then OHS is shutting down....
[2017-08-23T14:15:51.3742-04:00] [OHS] [NOTIFICATION:16] [OHS-9999] [core.c] [host_id: VL-VM-WSA731] [host_addr: 10.247.166.231] [pid: 11519] [tid: 1] [user: oracle] [VirtualHost: main] caught SIGTERM, shutting down
snoop (tcpdump)
When I do start OHS, I can see Webgate trying to reach the OAM primary server (which is down).
It should detects that it's down and try to connect with the other Primary server (or the secondaries).
This proves that webgate tries to connect ONLY with the Primary server.
I do not see any packets going to another Primary or Secondary server.
truss
If I check a bit deeper (on the process level), I can see those timeouts as well.
I can see that the process is starting a lot of LWP and wait....
(about 68 LWP):
...
...
11537/1: 0.0000 lwp_wait(64, 0xFFFF80FFBFFFE9CC) = 0
11537/1: 0.0001 lwp_wait(65, 0xFFFF80FFBFFFE9CC) = 0
11537/1: 0.0000 lwp_wait(66, 0xFFFF80FFBFFFE9CC) = 0
11537/1: 0.0000 lwp_wait(67, 0xFFFF80FFBFFFE9CC) = 0
11537/1: 0.0000 lwp_wait(68, 0xFFFF80FFBFFFE9CC) = 0
The process receives SIGHUP and SIGTERM after a while. Then several lwp_kill occurs:
/1: lwp_kill(69, SIGHUP) = 0
/69: Received signal #1, SIGHUP [caught]
/69: siginfo: SIGHUP pid=1727 tid=1 uid=5000 SI_LWP
/1: lwp_kill(69, SIG#0) = 0
...
...
/69: lwp_unpark(53) = 0
/1: Received signal #15, SIGTERM, in pollsys() [caught]
/1: siginfo: SIG#0
...
...
/17: lwp_exit()
/23: lwp_kill(9, SIG#0) Err#3 ESRCH
Conclusion
With all these evidences... I think the OHS starts "webgate" module to connect to the OAM primary server for few reasons:
- gets its configuration file (ObAccessClient.xml) or
- try to resume the last known connections (before the OAM primary server get down).
For any reasons, webgate is unable to switch to another Primary or Secondary server. (and it should).
Anyone got this problem?
Thank you.