Intermittent TNS-12535 Errors
Starting this past Friday, we began seeing database connection errors in our application logs (we run Ecometry software on application servers, connecting to an Oracle 10g database on Windows 2003). Digging into the sqlnet.log on the app server showed me TNS-12535 errors. Not a ton, but happening frequently enough to break some jobs while they're running. They all pretty much look like this:
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 - Production
Time: 01-NOV-2011 16:12:40
Tracing to file: C:\oracle\product\10.2.0\client_1\NETWORK\TRACE\sqlnet_620_3.trc
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>
Tnsping and telnet to 1521 all work fine from the app server command line. The database isn't recording anything in the alert.log, and in most cases, not in the listener.log either. This application setup has been running without any such issues for over a year. I rebooted the app server the other night in hopes of resolving it, but it's gotten worse. I'm planning to reboot the DB server tonight, although I know you should never need to do that.
I turned on SQL tracing today, and found the following pattern in every case where the error occurred:
[01-NOV-2011 16:12:18:882] nsopen: opening transport...
[01-NOV-2011 16:12:18:882] nttcon: entry
[01-NOV-2011 16:12:18:882] nttcon: toc = 1
[01-NOV-2011 16:12:18:882] nttcnp: entry
[01-NOV-2011 16:12:18:882] nttcnp: exit
[01-NOV-2011 16:12:18:882] nttcni: entry
[01-NOV-2011 16:12:18:882] nttcni: trying to connect to socket 3596.
[01-NOV-2011 16:12:40:100] ntt2err: entry
[01-NOV-2011 16:12:40:100] ntt2err: soc 3596 error - operation=1, ntresnt[0]=505, ntresnt[1]=60, ntresnt[2]=0
[01-NOV-2011 16:12:40:100] ntt2err: exit
[01-NOV-2011 16:12:40:210] nttcni: exit
[01-NOV-2011 16:12:40:210] nttcon: exit
[01-NOV-2011 16:12:40:210] nserror: nsres: id=0, op=65, ns=12535, ns2=12560; nt[0]=505, nt[1]=60, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[01-NOV-2011 16:12:40:210] nsopen: unable to open transport
[01-NOV-2011 16:12:40:210] nsmfr: 1644 bytes at 0x1bbc460
[01-NOV-2011 16:12:40:210] nsmfr: 736 bytes at 0x1bbc178
[01-NOV-2011 16:12:40:210] nladget: entry
[01-NOV-2011 16:12:40:210] nladget: exit
There's always just about 20 seconds between the "nttcni: trying to connect to socket" and the "ntt2err", which makes me think there's some internal timeout of 20 seconds kicking in here.
I've made sure my sqlnet.ora files are set to SQLNET.AUTHENTICATION_SERVICES= (NONE) and NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT), and I've added a hosts file entry for my DB server on the app server so there's no DNS involved.
My network admin didn't find anything suspicious on the switches, but I'm not sure how far he looked. I'm thinking that if the DB server reboot doesn't fix it, it's got to be something in the network layer.
Anybody have any similar experiences?