Hello all,
I've been using oracle vm for some time now, and until recently everything was just great.
Until the last patch everything was just running fine without issues.
We have some OVM pools for different type of servers. When we started patching lately (we do quarterly patch cycles), some of our test hypervisors started crashing hard without any information on the logs of the server.
We fully patched Oracle VM to the 3.4.6 release running the kernel 4.1.12-124.30.1.el6uek.x86_64.
When the servers started to crash, I've done the extended hardware test on the ilom, and got some errors on the cpus.
I procured replacement cpus for one of the servers, and did the extended hardware test again, that finished without errors.
The problem is that the crashes are still happening.
I configured the kdump functionality to see if I could get more details on the os on the reason of the crashes, but no dump is generated on the system crash.
I've also found multiple references to the issue reported on https://docs.oracle.com/cd/E64076_01/E76173/html/vmrns-bugs-3.3.2-swiotlb-buffer-errors-jumbo-frames.html , because I'm using iscsi disks with multiple 10Gb ethernet with jumbo frames with link aggregation, I changed the value on the kernel command line, to no luck on solving my issues.
Note that the same storage is being used on other pools that were not updated, and have no issues until now.
The only lead I have is on the ilom of the server with the replaced cpus, that still brings out the following message when the server crashes:
| Wed Sep 4 21:01:29 2019 | IPMI | Log | minor | |
| | ID = 8bb : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted |
2274 | Wed Sep 4 21:01:29 2019 | IPMI | Log | minor |
| | ID = 8ba : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted |
2273 | Wed Sep 4 21:01:21 2019 | IPMI | Log | minor |
| | ID = 8b9 : 09/04/2019 : 21:01:21 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 3 OVER = 1 UC = 1 EN = 0 MISCV = 0 ADDRV = 0 PCC = 1 S = 0 AR = 0 : Asserted |
2272 | Wed Sep 4 21:01:21 2019 | IPMI | Log | minor |
| | ID = 8b8 : 09/04/2019 : 21:01:21 : System Firmware Progress : SMI Handler : Secondary CPU Initialization : Asserted |
2271 | Wed Sep 4 21:01:21 2019 | IPMI | Log | minor |
| | ID = 8b7 : 09/04/2019 : 21:01:20 : System Firmware Progress : SMI Handler : Primary CPU initialization : Asserted |
2270 | Wed Sep 4 21:00:53 2019 | IPMI | Log | minor |
| | ID = 8b6 : 09/04/2019 : 21:00:53 : System Firmware Progress : SMI Handler : Memory initialization : Asserted |
2269 | Wed Sep 4 21:00:51 2019 | IPMI | Log | minor |
| | ID = 8b5 : 09/04/2019 : 21:00:51 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted |
2268 | Wed Sep 4 21:00:50 2019 | IPMI | Log | minor |
| | ID = 8b4 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted |
2267 | Wed Sep 4 21:00:50 2019 | IPMI | Log | minor |
| | ID = 8b3 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted |
2266 | Wed Sep 4 21:00:49 2019 | IPMI | Log | minor |
| | ID = 8b2 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted |
2265 | Wed Sep 4 21:00:49 2019 | IPMI | Log | minor |
| | ID = 8b1 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted |
2264 | Wed Sep 4 21:00:47 2019 | Fault | Fault | critical |
| | Fault detected at time = Wed Sep 4 21:00:47 2019. The suspect component: /SYS/MB/P0 has fault.cpu.intel.quickpath.home_agent with probability=100. Refer to http://support.oracle.com/msg/SPX86-8003-CR for details. |
2263 | Wed Sep 4 21:00:47 2019 | IPMI | Log | minor |
| | ID = 8b0 : 09/04/2019 : 21:00:47 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted |
2262 | Wed Sep 4 21:00:44 2019 | IPMI | Log | minor |
| | ID = 8af : 09/04/2019 : 21:00:44 : System Firmware Progress : SMI Handler : Management controller initialization : Asserted |
2261 | Wed Sep 4 21:00:44 2019 | IPMI | Log | minor |
| | ID = 8ae : 09/04/2019 : 21:00:44 : Processor : System Management Software : IERR : Asserted |
2260 | Wed Sep 4 17:16:31 2019 | IPMI | Log | minor |
| | ID = 8ad : 09/04/2019 : 17:16:31 : System Firmware Progress : SMI Handler : System boot initiated : Asserted |
I've also updated the firmware on the server to the latest available level:
Integrated Lights Out Manager v4.0.4.22.a
Settings |
| 17160200 |
| Legacy |
| Ok |
Do any of you have a similar issue?
I can't open a support ticket with Oracle, as I'm not the original buyer of the servers.
Any help would be really appreciated.
Thank you.
Kind regards,
Jorge