Skip to Main Content

Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Solaris 10 panic and coredump PCI(-X) Express Fatal Error

blastwaveAug 15 2016 — edited Aug 23 2016

This keeps happening over and over on a few servers now. At first I thought it was due to a fancy new Hewlett Packard 10Gbit network card but I removed that from both servers. In this case both a Sun Fire X4140 ( firmware is the latest 3.0.6.16.b ) and also a HP Proliant DL385 server. Both systems are so unstable as to be entirely unusable now because they can not even boot to single user mode on the serial console. The /var/crash/`uname -n`/ area is full of vmdump files as well as a last vmdump.7.partial file. There just is no easy way to figure out what has gone wrong with either server and the support.oracle.com site is littered with examples of this error message "PCI(-X) Express Fatal Error" for a number of servers and different PCI cards. I am wondering is there has been a critical fault somewhere along the kernel dev process and if I should just go back and install the raw DVD with Solaris 10 1/13 into the SunFire X4140.  Every attempt to sort this out has failed and the very latest kernel 150401-40 panic's within a split second after early boot phase. It is just unstable beyond measure.  On two servers now.

module /platform/i86pc/kernel/amd64/unix: text at [0xfffffffffb800000, 0xfffffffffb91997f] data at 0xfffffffffbc00000

module misc/amd64/krtld: text at [0xfffffffffb919980, 0xfffffffffb94eb1f] data at 0xfffffffffbc73400

module /kernel/amd64/genunix: text at [0xfffffffffb94eb20, 0xfffffffffbb7c577] data at 0xfffffffffbc7e080

SunOS Release 5.10 Version Generic_150401-40 64-bit

Copyright (c) 1983, 2016, Oracle and/or its affiliates. All rights reserved.

x86_feature: lgpg

x86_feature: tsc

x86_feature: msr

x86_feature: mtrr

x86_feature: pge

x86_feature: cmov

x86_feature: mmx

x86_feature: mca

x86_feature: pae

x86_feature: cv8

x86_feature: pat

x86_feature: sse

x86_feature: sse2

x86_feature: asysc

x86_feature: nx

x86_feature: sse3

x86_feature: cx16

x86_feature: cmp

x86_feature: tscp

x86_feature: mwait

x86_feature: sse4a

x86_feature: cpuid

x86_feature: 1gpg

x86_feature: clfsh

mem = 33553624K (0x7fff36000)

SMBIOS v2.5 loaded (3894 bytes)initialized model-specific module 'cpu_ms.AuthenticAMD' on chip 0 core 0 strand 0

root nexus = i86pc

pseudo0 at root

pseudo0 is /pseudo

scsi_vhci0 at root

scsi_vhci0 is /scsi_vhci

isa0 at root

Reading Intel IOMMU boot options

pseudo-device: ppm0

ppm0 is /pseudo/ppm@0

npe0 at root: space 0 offset 0

npe0 is /pci@0,0

PCI Express-device: pci10de,375@f, pcieb2

pcieb2 is /pci@0,0/pci10de,375@f

/pci@0,0/pci10de,375@f/pci1000,3150@0 (mpt1):

        Rev. 4 LSI, Inc. 1068E found.

/pci@0,0/pci10de,375@f/pci1000,3150@0 (mpt1):

        mpt1 supports power management.

/pci@0,0/pci10de,375@f/pci1000,3150@0 (mpt1):

        mpt1 Firmware version v1.17.5a.0 (IR)

/pci@0,0/pci10de,375@f/pci1000,3150@0 (mpt1):

        mpt1: IOC Operational.

/pci@0,0/pci10de,375@f/pci1000,3150@0 (mpt1):

        NVDATA default version 0x2b21, persistent version 0x2b21PCIE-device: pci1000,3150@0, mpt1

PCI Express-device: pci1000,3150@0, mpt1

mpt1 is /pci@0,0/pci10de,375@f/pci1000,3150@0

sd25 at mpt1: unit-address 0,0: target 0 lun 0

sd25 is /pci@0,0/pci10de,375@f/pci1000,3150@0/sd@0,0

/pci@0,0/pci10de,375@f/pci1000,3150@0/sd@0,0 (sd25) online

sd26 at mpt1: unit-address 1,0: target 1 lun 0

sd26 is /pci@0,0/pci10de,375@f/pci1000,3150@0/sd@1,0

/pci@0,0/pci10de,375@f/pci1000,3150@0/sd@1,0 (sd26) online

sd27 at mpt1: unit-address 2,0: target 2 lun 0

sd27 is /pci@0,0/pci10de,375@f/pci1000,3150@0/sd@2,0

/pci@0,0/pci10de,375@f/pci1000,3150@0/sd@2,0 (sd27) online

root on pluto_rpool/ROOT/s10x_u11wos_24a fstype zfs

.

.

.

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major

EVENT-TIME: 0x57b139c5.0x199bafa5 (0x37de78d4c3)

PLATFORM: i86pc, CSN: -, HOSTNAME: pluto

SOURCE: SunOS, REV: 5.10 Generic_150401-40

DESC: Errors have been detected that require a reboot to ensure system

integrity.  See http://www.sun.com/msg/SUNOS-8000-0G for more information.

AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry

IMPACT: The system will sync files, save a crash dump if needed, and reboot

REC-ACTION: Save the error summary below in case telemetry cannot be saved

panic[cpu4]/thread=fffffe8000874c40: pcieb-1: PCI(-X) Express Fatal Error. (0x101)

fffffe8000874bd0 pcieb:pcieb_intr_handler+2bb ()

fffffe8000874c20 unix:av_dispatch_autovect+78 ()

fffffe8000874c30 unix:intr_thread+5f ()

syncing file systems... done

ereport.io.pci.fabric ena=37db72173001001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e" ] bdf=70 device_id=376 vendor_id=10de

rev_id=a3 dev_type=40 pcie_off=80 pcix_off=0 aer_off=160 ecc_ver=0 pci_status=

10 pci_command=47 pci_bdg_sec_status=0 pci_bdg_ctrl=3 pcie_status=1

pcie_command=2037 pcie_dev_cap=8001 pcie_link_status=3081 pcie_adv_ctl=1a0

pcie_ue_status=0 pcie_ue_mask=180000 pcie_ue_sev=62011 pcie_ue_hdr0=0

pcie_ue_hdr1=0 pcie_ue_hdr2=0 pcie_ue_hdr3=0 pcie_ce_status=1 pcie_ce_mask=0

pcie_rp_status=0 pcie_rp_control=0 pcie_adv_rp_status=8000003

pcie_adv_rp_command=7 pcie_adv_rp_ce_src_id=70 pcie_adv_rp_ue_src_id=0

remainder=1 severity=1

ereport.io.pci.fabric ena=37db724c8201001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e/pci9005,554@0" ] bdf=300 device_id=28d

vendor_id=9005 rev_id=1 dev_type=0 pcie_off=c0 pcix_off=0 aer_off=100 ecc_ver=

0 pci_status=10 pci_command=47 pcie_status=1 pcie_command=37 pcie_dev_cap=81a2

pcie_link_status=1081 pcie_dev_ctl2=0 pcie_adv_ctl=a0 pcie_ue_status=0

pcie_ue_mask=180000 pcie_ue_sev=62030 pcie_ue_hdr0=0 pcie_ue_hdr1=0

pcie_ue_hdr2=0 pcie_ue_hdr3=0 pcie_ce_status=1000 pcie_ce_mask=0 remainder=0

severity=3

ereport.io.pci.fabric ena=37db72fb4201001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e" ] bdf=70 device_id=376 vendor_id=10de

rev_id=a3 dev_type=40 pcie_off=80 pcix_off=0 aer_off=160 ecc_ver=0 pci_status=

10 pci_command=47 pci_bdg_sec_status=0 pci_bdg_ctrl=3 pcie_status=1

pcie_command=2037 pcie_dev_cap=8001 pcie_link_status=3081 pcie_adv_ctl=1a0

pcie_ue_status=0 pcie_ue_mask=180000 pcie_ue_sev=62011 pcie_ue_hdr0=0

pcie_ue_hdr1=0 pcie_ue_hdr2=0 pcie_ue_hdr3=0 pcie_ce_status=1001 pcie_ce_mask=

0 pcie_rp_status=0 pcie_rp_control=0 pcie_adv_rp_status=8000003

pcie_adv_rp_command=7 pcie_adv_rp_ce_src_id=70 pcie_adv_rp_ue_src_id=0

remainder=1 severity=1

ereport.io.pci.fabric ena=37db7330c601001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e/pci9005,554@0" ] bdf=300 device_id=28d

vendor_id=9005 rev_id=1 dev_type=0 pcie_off=c0 pcix_off=0 aer_off=100 ecc_ver=

0 pci_status=10 pci_command=47 pcie_status=0 pcie_command=37 pcie_dev_cap=81a2

pcie_link_status=1081 pcie_dev_ctl2=0 pcie_adv_ctl=a0 pcie_ue_status=0

pcie_ue_mask=180000 pcie_ue_sev=62030 pcie_ue_hdr0=0 pcie_ue_hdr1=0

pcie_ue_hdr2=0 pcie_ue_hdr3=0 pcie_ce_status=0 pcie_ce_mask=0 remainder=0

severity=1

ereport.io.pci.nr ena=37de782c2901001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e/pci9005,554@0" ] orig_id=28d9005 new_id=

ffffffff

ereport.io.pci.fabric ena=37de7845c801001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e" ] bdf=70 device_id=376 vendor_id=10de

rev_id=a3 dev_type=40 pcie_off=80 pcix_off=0 aer_off=160 ecc_ver=0 pci_status=

10 pci_command=47 pci_bdg_sec_status=0 pci_bdg_ctrl=3 pcie_status=1

pcie_command=2037 pcie_dev_cap=8001 pcie_link_status=3081 pcie_adv_ctl=1a0

pcie_ue_status=0 pcie_ue_mask=180000 pcie_ue_sev=62011 pcie_ue_hdr0=0

pcie_ue_hdr1=0 pcie_ue_hdr2=0 pcie_ue_hdr3=0 pcie_ce_status=1 pcie_ce_mask=0

pcie_rp_status=0 pcie_rp_control=0 pcie_adv_rp_status=8000003

pcie_adv_rp_command=7 pcie_adv_rp_ce_src_id=70 pcie_adv_rp_ue_src_id=0

remainder=1 severity=1

ereport.io.pci.fabric ena=37de787d5501001 detector=[ version=0 scheme="dev"

device-path="/pci@0,0/pci10de,376@e/pci9005,554@0" ] bdf=300 device_id=28d

vendor_id=9005 rev_id=1 dev_type=0 pcie_off=c0 pcix_off=0 aer_off=100 ecc_ver=

0 pci_status=0 pci_command=0 pcie_status=0 pcie_command=0 pcie_dev_cap=0

pcie_link_status=0 pcie_dev_ctl2=0 pcie_adv_ctl=0 pcie_ue_status=0

pcie_ue_mask=0 pcie_ue_sev=0 pcie_ue_hdr0=0 pcie_ue_hdr1=0 pcie_ue_hdr2=0

pcie_ue_hdr3=0 pcie_ce_status=0 pcie_ce_mask=0 remainder=0 severity=100

dumping to /dev/zvol/dsk/pluto_rpool/dump, offset 65536, content: all

1:16  87% done

1:29 100% done

100% done: 8388407 pages dumped, dump succeeded

rebooting...

This then sends either server into a boot loop and the /var/crash/ area has filled up. Actually coreadm services never have a chance to run anymore.

In any case .. some of the documents on support.oracle.com actually say "not feasible to fix" as part of the bug reports.

So opening a support case may be interesting where Oracle support may say "oh yes .. that .. we know about that .. it won't be fixed" and maybe there is a critical failure somewhere within the X86 kernel now. Perhaps Solaris 11 may solve the problem?  I don't know and wonder who else is also seeing this behavior?

Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Sep 20 2016
Added on Aug 15 2016
1 comment
1,016 views