Hi,
Recently my Solaris10 box panic and crashed a couple of times. After crash I encountered strange error on my ZFS partition. Unfortunately, I haven't enough space for crash dump to be generated. However, here what I've got in messages after boot:
Apr 13 04:31:16 core0 savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=fffffe800087ea20 addr=10000018 occurred in module "zfs" due to an illegal access to a user address
Then I checked my ZFS and:
# zpool status -vx
pool: box5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2d1 ONLINE 0 0 0
c1d1 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
box5:<0x0>
I decided to run scrubbing to get rid of this message which didn't make any sense to me. However scrub revealed even more strange results:
# zpool status -xv
pool: box5
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub in progress, 1.20% done, 176h37m to go
config:
NAME STATE READ WRITE CKSUM
box5 ONLINE 0 0 4
mirror ONLINE 0 0 2
c1d0 ONLINE 0 0 4
c2d0 ONLINE 0 0 4
mirror ONLINE 0 0 2
c2d1 ONLINE 0 0 4
c1d1 ONLINE 0 0 4
errors: Permanent errors have been detected in the following files:
box5:<0x0>
As you can see it's quite large partition and it will take another week to complete. However, synchronous checksum errors on all disks are really confusing.
I've Solaris 10U4 (Generic_127112-07) with 127729-07 patch applied which basically address some panic ZFS/NFS problems. I'll post more crash dump analysis as soon as it crashes again (hope it will not happen though).
Any ideas? What is 0x0? Why I've got similar checksum errors on all disks? How to fix all that mess?
--
Rustam