Sebastien DAUBIGNE
2010-09-16 10:03:17 UTC
Dear Vx-addicts,
We encountered a failover issue on this configuration :
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
support imposed the Solaris native solution for multipathing :
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
Now for the incident we encountered :
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Note the "giving up" message from scsi layer at 06:23:03 :
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:19:05 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:22:45 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003787 (ssd166):
Sep 1 06:23:03 myserver SCSI transport failed: reason 'timeout':
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
We encountered a failover issue on this configuration :
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
support imposed the Solaris native solution for multipathing :
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
Now for the incident we encountered :
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Note the "giving up" message from scsi layer at 06:23:03 :
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:19:05 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:22:45 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003787 (ssd166):
Sep 1 06:23:03 myserver SCSI transport failed: reason 'timeout':
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix