Let me start out by stating the I am not an Oracle DBA so bear with me (I'm a SAN admin). We have an Oracle database (Oracle 11g R2) running on Windows 2008 R2. The backend storage for the database is a Hitachi VSP storage array. The server hosting the database has 48 GB memory and two 4 gb HBA's (the ports on the SAN switch are 8 GB). The application using this Oracle Database is Kronos (timecard/payroll). Every 2 weeks a job kicks off which totals all employee timecards/payroll, etc. and when that job runs the database grinds to a halt. This performance issue started about a month ago and everyone (DBA, Application team, Storage Team) claims nothing has changed. During the poor performance times the Storage side show fairly low IOPS, never going above 600 IOPsn and the response time for the disks allocated to this database is 4-5ms on average. However, on the database side they are seeing disk latency times of over 20 seconds for read and up to 28 seconds for write. I monitored the HBAs "live" during the slow response times and the IOPs were evenly divided between the 2 HBA's at about 300 IOPs each. On the SAN Switch side I don't see any errors on the ports, on the Array side there are no hardware errors. I don't see how the latency issues are storage related, if the latency was as extreme as 28 seconds you'd expect to see something on the array side but we don't. There is a SQL database on a different server that is attached to the same array as the problem database, using the same disk pool as the problem database, doing 10x the IOPs and experiencing no performance issues. (The array pool performance stats are good too---never about 2 ms average response time). You would think that if the problem was the storage array we'd be seeing performance problems on other database servers that are using the same disk pool, array ports, etc. but we aren't--they are all doing great.
Does anyone have any idea what might cause such high latency even though the IOPs are fairly low? I realize it's probably hard to answer this question without some Oracle database information but I don't have any of that info other than Oracle database version. TempDB too small? Too many database extents? Fragmentation? Database block I/O size too small (or too big?).
The server itself doesn't experience high CPU when the performance is bad and the memory usage is high but not exhausted. I don't know Windows very well so I can only look at the basics. Using Perfmon the disk average response time was 2-3ms.