Skip to Main Content

Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Exadata X3-2 crashing with Oracle VM 3.4.6

schmee_fxSep 5 2019 — edited Oct 8 2019

Hello all,

I've been using oracle vm for some time now, and until recently everything was just great.

Until the last patch everything was just running fine without issues.

We have some OVM pools for different type of servers. When we started patching lately (we do quarterly patch cycles), some of our test hypervisors started crashing hard without any information on the logs of the server.

We fully patched Oracle VM to the 3.4.6 release running the kernel 4.1.12-124.30.1.el6uek.x86_64.

When the servers started to crash, I've done the extended hardware test on the ilom, and got some errors on the cpus.

I procured replacement cpus for one of the servers, and did the extended hardware test again, that finished without errors.

The problem is that the crashes are still happening.

I configured the kdump functionality to see if I could get more details on the os on the reason of the crashes, but no dump is generated on the system crash.

I've also found multiple references to the issue reported on https://docs.oracle.com/cd/E64076_01/E76173/html/vmrns-bugs-3.3.2-swiotlb-buffer-errors-jumbo-frames.html , because I'm using iscsi disks with multiple 10Gb ethernet with jumbo frames with link aggregation, I changed the value on the kernel command line, to no luck on solving my issues.

Note that the same storage is being used on other pools that were not updated, and have no issues until now.

The only lead I have is on the ilom of the server with the replaced cpus, that still brings out the following message when the server crashes:

Wed Sep 4 21:01:29 2019IPMILogminor
ID = 8bb : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted

2274

Wed Sep 4 21:01:29 2019IPMILogminor
ID = 8ba : 09/04/2019 : 21:01:29 : System Firmware Progress : SMI Handler : PCI resource configuration : Asserted

2273

Wed Sep 4 21:01:21 2019IPMILogminor
ID = 8b9 : 09/04/2019 : 21:01:21 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 3 OVER = 1 UC = 1 EN = 0 MISCV = 0 ADDRV = 0 PCC = 1 S = 0 AR = 0 : Asserted

2272

Wed Sep 4 21:01:21 2019IPMILogminor
ID = 8b8 : 09/04/2019 : 21:01:21 : System Firmware Progress : SMI Handler : Secondary CPU Initialization : Asserted

2271

Wed Sep 4 21:01:21 2019IPMILogminor
ID = 8b7 : 09/04/2019 : 21:01:20 : System Firmware Progress : SMI Handler : Primary CPU initialization : Asserted

2270

Wed Sep 4 21:00:53 2019IPMILogminor
ID = 8b6 : 09/04/2019 : 21:00:53 : System Firmware Progress : SMI Handler : Memory initialization : Asserted

2269

Wed Sep 4 21:00:51 2019IPMILogminor
ID = 8b5 : 09/04/2019 : 21:00:51 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted

2268

Wed Sep 4 21:00:50 2019IPMILogminor
ID = 8b4 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 1 Bank = 17 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 1 AR = 0 : Asserted

2267

Wed Sep 4 21:00:50 2019IPMILogminor
ID = 8b3 : 09/04/2019 : 21:00:50 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

2266

Wed Sep 4 21:00:49 2019IPMILogminor
ID = 8b2 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 5 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

2265

Wed Sep 4 21:00:49 2019IPMILogminor
ID = 8b1 : 09/04/2019 : 21:00:49 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

2264

Wed Sep 4 21:00:47 2019FaultFaultcritical
Fault detected at time = Wed Sep 4 21:00:47 2019. The suspect component: /SYS/MB/P0 has fault.cpu.intel.quickpath.home_agent with probability=100. Refer to http://support.oracle.com/msg/SPX86-8003-CR for details.

2263

Wed Sep 4 21:00:47 2019IPMILogminor
ID = 8b0 : 09/04/2019 : 21:00:47 : Processor : BIOS : SM BIOS Uncorrectable CPU-complex Error Node 0 Bank = 20 OVER = 0 UC = 1 EN = 1 MISCV = 1 ADDRV = 1 PCC = 1 S = 0 AR = 0 : Asserted

2262

Wed Sep 4 21:00:44 2019IPMILogminor
ID = 8af : 09/04/2019 : 21:00:44 : System Firmware Progress : SMI Handler : Management controller initialization : Asserted

2261

Wed Sep 4 21:00:44 2019IPMILogminor
ID = 8ae : 09/04/2019 : 21:00:44 : Processor : System Management Software : IERR : Asserted

2260

Wed Sep 4 17:16:31 2019IPMILogminor
ID = 8ad : 09/04/2019 : 17:16:31 : System Firmware Progress : SMI Handler : System boot initiated : Asserted

I've also updated the firmware on the server to the latest available level:

Integrated Lights Out Manager v4.0.4.22.a

Settings

17160200
Legacy
Ok

Do any of you have a similar issue?

I can't open a support ticket with Oracle, as I'm not the original buyer of the servers.

Any help would be really appreciated.

Thank you.

Kind regards,

Jorge

This post has been answered by schmee_fx on Oct 4 2019
Jump to Answer
Comments
Post Details
Added on Sep 5 2019
9 comments
1,009 views