Security Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

OAM Webgate - Not starting if OAM primary is down

user3596587Aug 23 2017 — edited Aug 25 2017

Hi,

Webgate on OHS do not start if the OAM primary server is down.

Let me show some evidences before I share my thoughts on this.

Startup evidence

I do have the following error when I try to start OHS:

$ ./opmnctl startall

opmnctl startall: starting opmn and all managed processes...

================================================================================

opmn id=VL-VM-WSA731:6701

Response: 0 of 1 processes started.

ias-instance id=WGLegacy

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--------------------------------------------------------------------------------

ias-component/process-type/process-set:

WGLegacy1/OHS/OHS/

Error

--> Process (index=1,uid=714737196,pid=11519)

time out while waiting for a managed process to start

Log:

/apps/idm/instances/WGLegacy/diagnostics/logs/OHS/WGLegacy1/console~OHS~1.log

So OHS is telling us that a "timeout occured".

Logs

I can see these timeouts in the log files:

The log have a lot of these lines saying "child process xxxx still dis not exit, sending a SIGKILL". Here is an example:

[2017-08-23T14:15:50.3738-04:00] [OHS] [ERROR:32] [OHS-9999] [core.c] [host_id: VL-VM-WSA731] [host_addr: 10.247.166.231] [pid: 11519] [tid: 1] [user: oracle] [VirtualHost: main] child process 11547 still did not exit, sending a SIGKILL

Then OHS is shutting down....

[2017-08-23T14:15:51.3742-04:00] [OHS] [NOTIFICATION:16] [OHS-9999] [core.c] [host_id: VL-VM-WSA731] [host_addr: 10.247.166.231] [pid: 11519] [tid: 1] [user: oracle] [VirtualHost: main] caught SIGTERM, shutting down

snoop (tcpdump)

When I do start OHS, I can see Webgate trying to reach the OAM primary server (which is down).

It should detects that it's down and try to connect with the other Primary server (or the secondaries).

This proves that webgate tries to connect ONLY with the Primary server.

I do not see any packets going to another Primary or Secondary server.

truss

If I check a bit deeper (on the process level), I can see those timeouts as well.

I can see that the process is starting a lot of LWP and wait....

(about 68 LWP):

...

11537/1: 0.0000 lwp_wait(64, 0xFFFF80FFBFFFE9CC) = 0

11537/1: 0.0001 lwp_wait(65, 0xFFFF80FFBFFFE9CC) = 0

11537/1: 0.0000 lwp_wait(66, 0xFFFF80FFBFFFE9CC) = 0

11537/1: 0.0000 lwp_wait(67, 0xFFFF80FFBFFFE9CC) = 0

11537/1: 0.0000 lwp_wait(68, 0xFFFF80FFBFFFE9CC) = 0

The process receives SIGHUP and SIGTERM after a while. Then several lwp_kill occurs:

/1: lwp_kill(69, SIGHUP) = 0

/69: Received signal #1, SIGHUP [caught]

/69: siginfo: SIGHUP pid=1727 tid=1 uid=5000 SI_LWP

/1: lwp_kill(69, SIG#0) = 0

...

/69: lwp_unpark(53) = 0

/1: Received signal #15, SIGTERM, in pollsys() [caught]

/1: siginfo: SIG#0

...

/17: lwp_exit()

/23: lwp_kill(9, SIG#0) Err#3 ESRCH

Conclusion

With all these evidences... I think the OHS starts "webgate" module to connect to the OAM primary server for few reasons:

gets its configuration file (ObAccessClient.xml) or
try to resume the last known connections (before the OAM primary server get down).

For any reasons, webgate is unable to switch to another Primary or Secondary server. (and it should).

Anyone got this problem?

Thank you.

Locked Post

New comments cannot be posted to this locked post.

Locked on Sep 22 2017

Added on Aug 23 2017

#identity-management, #identity-manager

4 comments

376 views