Hello.
I'm hoping someone more experienced can give me some hints how to investigate a weird issue - someone who faced a similar issue and was able to find the cause. We have multi-threaded application in production environment that processes fixed format files and does some simple validations which are all in C# so on DB side we basically have only CRUD operations. For the last year and a half the current version worked fine but recently the client started to experience hangs. I made a full memory dump of the hanging process and looked at it using windbg and found that some threads got stuck on ODP.NET calls. Here are a few examples:
Thread 1 .NET:
0b95eb38 7c8283ac [NDirectMethodFrameGeneric: 0b95eb38] Oracle.DataAccess.Client.OpsSql.ExecuteNonQuery(IntPtr, IntPtr ByRef, IntPtr ByRef, IntPtr ByRef, IntPtr, Int32 ByRef, Int32, Int32, Int64 ByRef, Oracle.DataAccess.Client.OpoSqlValCtx* ByRef, System.String, IntPtr ByRef, IntPtr[], System.String[], Oracle.DataAccess.Client.OpoMetValCtx* ByRef, Int32)
0b95eb80 0597a630 Oracle.DataAccess.Client.OracleCommand.ExecuteNonQuery()
0b95ecf0 07c88615 CIP.Repositories.Oracle.DomainRepository.DomainExists(System.String, Param[])
0b95ed0c 07c89a4a CIP.Repositories.Oracle.DomainRepository.ExistsCountry(System.String)
0b95ed1c 07c899e0 CIP.Control.Validators.DomainValidator+<>c__DisplayClass16.<ExistsCountry>b__15()
Thread 2 .NET:
0c4cf800 7c8283ac [InlinedCallFrame: 0c4cf800] Oracle.DataAccess.Client.OpsTxn.Commit(IntPtr, IntPtr, Oracle.DataAccess.Client.OpoTxnValCtx*)
0c4cf7fc 066930fd Oracle.DataAccess.Client.OracleTransaction.Commit()
0c4cf860 06692fe3 CIP.Repositories.Oracle.Infrastructure.OracleSession.CommitTransaction()
0c4cf868 07a8093c CIP.Control.Services.ServiceBase.WriteLuIntoRepo(CIP.NT4FilesModel.LogicalUnits.LuBase)
0c4cf8b4 08f2c6c3 CIP.BatchImporter.ProcExecGateways.BatchServices.ExecCntCpt(CIP.BatchImporter.ProcessMessages.CntCptInput)
Thread 3. NET:
0c2cf2e8 7c8283ac [NDirectMethodFrameStandalone: 0c2cf2e8] Oracle.DataAccess.Client.OpsCon.CheckConStatus(IntPtr, IntPtr, Int32, Int32 ByRef, Int32, Int32)
0c2cf308 064f5053 Oracle.DataAccess.Client.ConnectionPool.CheckLifeTimeAndStatus(Oracle.DataAccess.Client.OpoConCtx ByRef, Int32, Boolean ByRef, Int32, Boolean)
0c2cf3a0 064f0718 Oracle.DataAccess.Client.ConnectionPool.PutConnection(Oracle.DataAccess.Client.OpoConCtx ByRef, Boolean, Boolean, Boolean, Int32)
0c2cf3a4 064f4b21 [InlinedCallFrame: 0c2cf3a4]
0c2cf51c 064f47f8 Oracle.DataAccess.Client.OracleConnection.Close()
Thread 4 .NET:
0d42f19c 7c8283ac [InlinedCallFrame: 0d42f19c] Oracle.DataAccess.Client.OpsTxn.Commit(IntPtr, IntPtr, Oracle.DataAccess.Client.OpoTxnValCtx*)
0d42f198 066930fd Oracle.DataAccess.Client.OracleTransaction.Commit()
0d42f1fc 06692fe3 CIP.Repositories.Oracle.Infrastructure.OracleSession.CommitTransaction()
0d42f204 07a85087 CIP.ElabManager.ElaborationStatusManager.WriteElaborationErrors(CIP.ElabManager.ErrorManager, CIP.NT4FilesModel.LogicalUnits.LuBase)
0d42f234 08f2c6dc CIP.BatchImporter.ProcExecGateways.BatchServices.ExecCntCpt(CIP.BatchImporter.ProcessMessages.CntCptInput)
Looking deeper at the full stack and not just .NET stack I found that all threads are basically waiting to receive data which never happens (at least I hope I'm reading it right):
0b958b10 7c827ac9 ntdll!ZwWaitForSingleObject+0xc
0b958b14 71b21af5 mswsock!SockWaitForSingleObject+0x19d, calling ntdll!ZwWaitForSingleObject
0b958b50 71b2c517 mswsock!WSPRecv+0x203, calling mswsock!SockWaitForSingleObject
0b958bc8 71c094e5 ws2_32!WSARecv+0x77
The problem with Thread 1 can be mitigated using CommandTimeout but did not find a way to do the same for the other three so it is really not a solution.
It seems like a network problem to me but what kind on network problem would be responsible for this and didn't cause a error either on application side or database side (not 100% sure about this because I do not have access only customers DBAs do)? And how should I investigate such a complicated issue? It happens rarely maybe once in 100 000 executions of ODP.NET methods that communicate with the database but in time more and more threads get stuck which lowers performance and eventually at the end of processing the application remains stuck waiting for the stuck threads to finish which never happens.
Some basic info about the environment in case this is a bug which is known and I missed it:
Database:
Oracle 10.1.0.5
Some Linux based OS (never really cared about it while everything was working)
Application Server:
Windows 2003 R2
ODP.NET 11.1.0.7.0
.NET Framework 2.0
Yes everything totally supported
Also the application uses connection pooling - Min Pool Size=5;Max Pool Size=120;Connection Timeout=60
I'll appreciate any suggestions.
Thank you.