Linux version 2.6.39-300.26.1.el5uek
Bond failing since recent upgrade. Had been working in previous version. Using MLAG across dual switches. Fast LACP configured on the bond. 10gb Nics.
Setup a test server, same hardware, OS, same switches, that emulate my prod environment where we caught this issue during upgrade of a switch.
boot box. Simple ping from other server in same subnet. bond is working.
bring down the port on the switch (or pull cable), Ping hangs. Started a tcpdump on the bond, Just running the tcpdump made the ping start working again, so I cant see whats happening when the it fails. Stop the tcpdump, ping hangs again. Open the port on the switch, ping starts again.
So I figured out that if after boot I run a service network restart, now the bond works if I close either port on the switches. Its like the startup script isnt getting the interfaces into full live mode.
So in rc3.d I put a script that does a restart AFTER the network service starts and this is how I get my box to boot with the bond doing what its supposed to do..
S40network -> ../init.d/network
S41network_restart -> ../init.d/network_restart
Logs seems to be saying that the initial network start is same as the restart. This smells bit like a bug to me and I have opened a call with oracle but anyone ever seen anything like this before?