VERY weird OVM networking problem
First, this may have nothing to do with OVM, but the problem I'm seeing is happening on all FIVE of my VM servers. I can't even guess what's going on. Maybe you have seen this and solved it, or maybe you can suggest what I can do to debug this.
Here are the details:
I have a 3-node cluster, built from brand new IBM x3650 M4 servers. I have a 2-node cluster, built from old retired IBM xSeries servers that used to be VMWare ESX servers. My office is on another subnet from the data center network that contains all 5 servers. I see the following symptoms occasionally and randomly (I see it at least once a day, and can make it happen multiple times a day if I keep trying):
1. When I SSH to one of these boxes, sometimes the session will freeze. At other times, the session will work perfectly, but after exiting the SSH session, any subsequent attempt to ping or connect via TCP from my desktop to that server will absolutely fail. That is, ping, traceroute, and ssh (and presumably, everything) failes.
2. If I don't try to communicate with that server for several minutes (maybe 5 to 10 minutes, I'm not sure), and then try to ping or SSH to the box, it works again!
3. However, if after, say, exiting SSH, and finding that I can't ping the server, if I keep pinging (from Windows, I say "ping -t -w 5000 vmserver") the pings will fail FOREVER. I have left it pinging for 24 hours, and it would never answer a ping. But then if I stop the pinging, wait ten minutes, and ping again, it works. I swear.
4. During the time that I am unable to ping one of VM servers from my desktop, if I try to ping that same server from another desktop machine in my office, connected to the SAME little 4-port switch, it works perfectly.
5. This is ever more weird: During the time that my desktop 1 cannot ping server1, say, I can go go desktop 2 (right beside desktop 1), open an SSH session to server1, and from there, I can ping desktop1. I swear.
Why would I think this is an Oracle VM problem, you ask? Well, I wouldn't, except that I have other servers in that data center, that run ESX, and AIX, and Windows, and I've tried like hell to make one of them exhibit one of these symptoms, and I cannot do it.
If you have any ideas about what the heck is going on, I'd appreciate hearing them.