Discussion:
[Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Sebastien DAUBIGNE
2010-09-16 10:03:17 UTC
Permalink
Dear Vx-addicts,

We encountered a failover issue on this configuration :

- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a

- storage on HP SAN (XP 24K).


Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
support imposed the Solaris native solution for multipathing :

VxVM ==> VxDMP ==> MPxIO ==> FCP ...

We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.

We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.

Now for the incident we encountered :

One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.

Note the "giving up" message from scsi layer at 06:23:03 :

Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:19:05 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:22:45 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003787 (ssd166):
Sep 1 06:23:03 myserver SCSI transport failed: reason 'timeout':
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984


It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.

As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.

Any advice ?

thanks.
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
Victor Engle
2010-09-16 10:15:09 UTC
Permalink
Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.

Google DMP fast recovery.

http://seer.entsupport.symantec.com/docs/307959.htm

I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.

Vic





On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
 Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep  1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep  1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep  1 06:18:54 myserver        SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep  1 06:21:57 myserver        SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep  1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep  1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep  1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO  : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
Sebastien DAUBIGNE
2010-09-16 13:40:44 UTC
Permalink
Thank you Victor and William, it seems to be a very good lead.

Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
or dmp_stat_interval

Something odd because my version is 5.0 MP3 Solaris SPARC, and according
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
Joshua Fielden
2010-09-16 14:50:34 UTC
Permalink
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack and send path inquiry/status CDBs directly from the HBA in order to bypass long SCSI queues and recover paths faster. With a TPD (third-party driver) such as MPxIO, bypassing the stack means we bypass the TPD completely, and interactions such as this can happen. The vxesd (event-source daemon) is another 5.0/MP2 backport addition that's moot in the presence of a TPD.

From your modinfo, you're not actually running MP3. This technote (http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly your scenario, but looking for partially-installed pkgs is a good start to getting your server correctly installed, then the tuneable should work -- very early 5.0 versions had a differently-named tuneable I can't find in my mail archive ATM.

Cheers,

Jf

-----Original Message-----
From: veritas-vx-***@mailman.eng.auburn.edu [mailto:veritas-vx-***@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE
Sent: Thursday, September 16, 2010 7:41 AM
To: Veritas-***@mailman.eng.auburn.edu
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Thank you Victor and William, it seems to be a very good lead.

Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
or dmp_stat_interval

Something odd because my version is 5.0 MP3 Solaris SPARC, and according
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix

_______________________________________________
Veritas-vx maillist - Veritas-***@mailman.eng.auburn.edu
http://mai
Sebastien DAUBIGNE
2010-10-06 16:31:50 UTC
Permalink
Hi,

I come back with my dmp_fast_recovery issue (VxDMP fails the path before
MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.

Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).

Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without
tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?

Joshua, you mentioned another tuneable for 5.0 but looking at the list I
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1

Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack and send path inquiry/status CDBs directly from the HBA in order to bypass long SCSI queues and recover paths faster. With a TPD (third-party driver) such as MPxIO, bypassing the stack means we bypass the TPD completely, and interactions such as this can happen. The vxesd (event-source daemon) is another 5.0/MP2 backport addition that's moot in the presence of a TPD.
Venkata Sreenivasa Rao Nagineni
2010-10-06 17:08:08 UTC
Permalink
Hi Sebastien,

In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?

Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path before
MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without
tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the list I
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1
Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's moot
in the presence of a TPD.
From your modinfo, you're not actually running MP3. This technote
(http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly
your scenario, but looking for partially-installed pkgs is a good start
to getting your server correctly installed, then the tuneable should
work -- very early 5.0 versions had a differently-named tuneable I
can't find in my mail archive ATM.
Cheers,
Jf
-----Original Message-----
Sent: Thursday, September 16, 2010 7:41 AM
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Thank you Victor and William, it seems to be a very good lead.
Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval,
dmp_path_age,
or dmp_stat_interval
Something odd because my version is 5.0 MP3 Solaris SPARC, and
according
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP
Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a
control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal
)
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced
a
Post by William Havey
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version
5.x,
Post by William Havey
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team
and HP
Post by William Havey
Post by Sebastien DAUBIGNE
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so
the
Post by William Havey
Post by Sebastien DAUBIGNE
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each
port
Post by William Havey
Post by Sebastien DAUBIGNE
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error
retries
Post by William Havey
Post by Sebastien DAUBIGNE
with "Unit attention"), due to SAN-side issues. Usually these
errors are
Post by William Havey
Post by Sebastien DAUBIGNE
transparently managed by MPxIO/VxVM without impact on the
applications.
Post by William Havey
Post by Sebastien DAUBIGNE
One of the SAN port was reset , consequently there were some
transient
Post by William Havey
Post by Sebastien DAUBIGNE
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should
have
Post by William Havey
Post by Sebastien DAUBIGNE
failover the I/O on the other path, without transmiting the error
to the
Post by William Havey
Post by Sebastien DAUBIGNE
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it
as
Post by William Havey
Post by Sebastien DAUBIGNE
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode
288/0x60
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x60
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode
288/0x20
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode
288/0x18
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x20
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver SCSI transport failed: reason
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver SCSI transport failed: reason
retrying command
Sep 1 06:23:03 myserver SCSI transport failed: reason
giving up
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error
buffer
Post by William Havey
Post by Sebastien DAUBIGNE
300ce41c340 on device 0x1200000003a to DMP
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write
error
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in
dev/block 0/5935
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system
disabled
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in
dev/block
Post by William Havey
Post by Sebastien DAUBIGNE
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I
though
Post by William Havey
Post by Sebastien DAUBIGNE
MPxIO would have conceal the I/O error until failover has occured,
which
Post by William Havey
Post by Sebastien DAUBIGNE
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give
MPxIO a
Post by William Havey
Post by Sebastien DAUBIGNE
chance to failover before VxDMP fails, but I still don't understand
why
Post by William Havey
Post by Sebastien DAUBIGNE
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mail
Dedhi Sujatmiko
2010-10-12 00:51:32 UTC
Permalink
On Wed, 6 Oct 2010 10:08:08 -0700
Post by Venkata Sreenivasa Rao Nagineni
Hi Sebastien,
In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?
I guess this part of his email said that "Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
support imposed the Solaris native solution for multipathing"
--
***@gmail.com <***@gmail.com>
Sebastien DAUBIGNE
2010-09-16 15:10:28 UTC
Permalink
Sorry, my mistake : The VxVM version is 5.0 GA, not 5.0 MP3.

The 316981 note states that fast_recovery is available in 5.0, but
neither manpage, nor administration guide nor vxdmpadm command
recognizes it.

However, I don't know if 5.0 GA behaviour is equivalent to
dmp_fast_recovery=off or dmp_fast_recovery=on.
The note states "In the case of a single path failure, MPxIO does not
notify DMP of the error, therefore, dmp_fast_recovery has no effect.",
hence it seems this parameter is not an issue in my case (single path
failure).

Maybe I should try to update to latest MP3 with dmp_fast_recovery=off.
Post by Sebastien DAUBIGNE
Thank you Victor and William, it seems to be a very good lead.
Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
or dmp_stat_interval
Something odd because my version is 5.0 MP3 Solaris SPARC, and
according to http://seer.entsupport.symantec.com/docs/316981.htm this
tunable should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
Sebastien DAUBIGNE
2010-09-17 08:13:02 UTC
Permalink
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Steven,<br>
<br>
As I said yesterday, I made a mistake with my version which is 5.0
GA.<br>
<br>
According to <tt><a class="moz-txt-link-freetext" href="http://seer.entsupport.symantec.com/docs/316981.htm">http://seer.entsupport.symantec.com/docs/316981.htm</a></tt>,
dmp_fast_recovery should be applicable to 5.0 on Solaris :<br>
<br>
<i><tt><span class="BodySubtitle">Products Applied:</span><br>
<span>Volume Manager for UNIX/Linux 4.1 MP2 (Solaris), 4.1 MP2
(Solaris) RP5, <b><u>5.0 (Solaris)</u></b>, 5.0 MP1
(Solaris), 5.0 MP3 (Solaris), 5.0 MP3 (Solaris) RP1, 5.0 MP3
(Solaris) RP2, 5.0 MP3 (Solaris) RP3</span></tt></i> <br>
<br>
<br>
Le 16/09/2010 20:31, Green, Steven a &eacute;crit&nbsp;:
<blockquote
cite="mid:***@CMX001.corp.tds.local"
type="cite">
<pre wrap="">Check the technote again ... It specifically lists only AIX as the applicable OS. Also, your modinfo output suggests you are not running MP3 anyway. Here is my modinfo output from a Solaris 10 system running VxVM 5.0 MP3:

[kultarr:root]: modinfo | grep -i vx
40 7be08000 3e4e0 183 1 vxdmp (VxVM 5.0MP3: DMP Driver)
42 7ba00000 209248 184 1 vxio (VxVM 5.0MP3 I/O driver)
44 7be073c0 c78 265 1 vxspec (VxVM 5.0MP3 control/status driv)
201 7be75228 cb0 266 1 vxportal (VxFS 5.0_REV-5.0MP3A25_sol port)
202 7aa00000 1d89e0 21 1 vxfs (VxFS 5.0_REV-5.0MP3A25_sol SunO)
224 7abe4000 a9e0 267 1 fdd (VxQIO 5.0_REV-5.0MP3A25_sol Qui)


-----Original Message-----
From: Sebastien DAUBIGNE [<a class="moz-txt-link-freetext" href="mailto:***@atosorigin.com">mailto:***@atosorigin.com</a>]
Sent: Thursday, September 16, 2010 8:41 AM
To: <a class="moz-txt-link-abbreviated" href="mailto:Veritas-***@mailman.eng.auburn.edu">Veritas-***@mailman.eng.auburn.edu</a>
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Thank you Victor and William, it seems to be a very good lead.

Unfortunately, this tunable seems not to be supported in the VxVM
version installed on my system :

&gt; vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
or dmp_stat_interval

Something odd because my version is 5.0 MP3 Solaris SPARC, and according
to <a class="moz-txt-link-freetext" href="http://seer.entsupport.symantec.com/docs/316981.htm">http://seer.entsupport.symantec.com/docs/316981.htm</a> this tunable
should be available.

&gt; modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )





Le 16/09/2010 12:15, Victor Engle a &eacute;crit :
</pre>
<blockquote type="cite">
<pre wrap="">Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.

Google DMP fast recovery.

<a class="moz-txt-link-freetext" href="http://seer.entsupport.symantec.com/docs/307959.htm">http://seer.entsupport.symantec.com/docs/307959.htm</a>

I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.

Vic





On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE <a class="moz-txt-link-rfc2396E" href="mailto:***@atosorigin.com">&lt;***@atosorigin.com&gt;</a> wrote:
</pre>
<blockquote type="cite">
<pre wrap=""> Dear Vx-addicts,

We encountered a failover issue on this configuration :

- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a

- storage on HP SAN (XP 24K).


Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
support imposed the Solaris native solution for multipathing :

VxVM ==&gt; VxDMP ==&gt; MPxIO ==&gt; FCP ...

We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.

We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.

Now for the incident we encountered :

One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.

Note the "giving up" message from scsi layer at 06:23:03 :

Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003794 (ssd165):
Sep 1 06:19:05 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:21:57 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e8015277700000127770000376d (ssd168):
Sep 1 06:22:45 myserver SCSI transport failed: reason 'timeout':
retrying command
Sep 1 06:23:03 myserver scsi: [ID 107833 kern.warning] WARNING:
/scsi_vhci/***@g60060e80152777000001277700003787 (ssd166):
Sep 1 06:23:03 myserver SCSI transport failed: reason 'timeout':
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984


It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.

As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.

Any advice ?

thanks.






--
Sebastien DAUBIGNE
<a class="moz-txt-link-abbreviated" href="mailto:***@atosorigin.com">***@atosorigin.com</a> - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix

_______________________________________________
Veritas-vx maillist - <a class="moz-txt-link-abbreviated" href="mailto:Veritas-***@mailman.eng.auburn.edu">Veritas-***@mailman.eng.auburn.edu</a>
<a class="moz-txt-link-freetext" href="http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx">http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx</a>

</pre>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<pre wrap="">

</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Sebastien DAUBIGNE
<a class="moz-txt-link-abbreviated" href="mailto:***@atosorigin.com">***@atosorigin.com</a> - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix</pre>
<br>
<br>
</body>
</html>
Marianne Van Den Berg
2010-09-17 19:41:55 UTC
Permalink
Your version does not seem to be MP3. As you can see, modinfo does not pick up any MP level.

5.0 MP3 was released October 6, 2008.

Please send output of:
# pkginfo -l VRTSvxvm

Regards

M.

-----Original Message-----
From: veritas-vx-***@mailman.eng.auburn.edu [mailto:veritas-vx-***@mailman.eng.auburn.edu] On Behalf Of Sebastien DAUBIGNE
Sent: 16 September 2010 03:41 PM
To: Veritas-***@mailman.eng.auburn.edu
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Thank you Victor and William, it seems to be a very good lead.

Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval, dmp_path_age,
or dmp_stat_interval

Something odd because my version is 5.0 MP3 Solaris SPARC, and according
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal )
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced a
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version 5.x,
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
***@atosorigin.com - +33(0)5.57.89.31.09
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix

_______________________________________________
Veritas-vx maillist - Veritas-***@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/
William Havey
2010-09-16 12:45:40 UTC
Permalink
Sebastien,

I found the following in my notes:

Ø See http://seer.entsupport.symantec.com/docs/288497.htm and
http://support.veritas.com/docs/276602

Ø If MPxIO is enabled on a host, the Veritas DMP tunable “dmp_fast_recovery”
must be set to off, after SF is installed.

vxdmpadm gettune dmp_fast_recovery

Tunable Current Value Default Value

------------------------------ ------------- -------------

dmp_fast_recovery on on

vxdmpadm settune dmp_fast_recovery=off <<>>where stored?

I am not certain the effect of not turning off the tunable but is
dmp_fast_recovery set to off?

Bill
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE <
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team and HP
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so the
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each port
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error retries
with "Unit attention"), due to SAN-side issues. Usually these errors are
transparently managed by MPxIO/VxVM without impact on the applications.
One of the SAN port was reset , consequently there were some transient
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should have
failover the I/O on the other path, without transmiting the error to the
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it as
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x60
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 917986 kern.notice] NOTICE: VxVM
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode 288/0x18
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x20
Sep 1 06:18:54 myserver vxdmp: [ID 824220 kern.notice] NOTICE: VxVM
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
retrying command
giving up
Sep 1 06:23:03 myserver vxio: [ID 539309 kern.warning] WARNING: VxVM
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error buffer
300ce41c340 on device 0x1200000003a to DMP
Sep 1 06:23:03 myserver vxio: [ID 771159 kern.warning] WARNING: VxVM
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write error
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block 0/5935
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system disabled
Sep 1 06:23:03 myserver vxfs: [ID 702911 kern.warning] WARNING: msgcnt
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in dev/block
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I though
MPxIO would have conceal the I/O error until failover has occured, which
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give MPxIO a
chance to failover before VxDMP fails, but I still don't understand why
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
Carl E. Ma
2010-09-21 01:20:53 UTC
Permalink
Hi,

We are using volume manager 5.0 on solaris 9. I have difficulty to interpret vxtrace output. What does "op" and "concurrency" stand for?

==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid 6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===

Doug has a script using field $9 and $11 to determine whether this is a random or sequential operation. Is it still valid in latest vxvm?

Thanks in advance,

carl
William Havey
2010-09-21 02:43:47 UTC
Permalink
"op" is short for "operation". Each line in the output of vxtrace is
prefixed with a number. This "op number" tells you which other lines forms a
complete trace of an I/O. Examine the output of vxtrace for all lines
containing "4275" and all these lines represent one I/O.
Post by Carl E. Ma
Hi,
We are using volume manager 5.0 on solaris 9. I have difficulty to
interpret vxtrace output. What does "op" and "concurrency" stand for?
==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid 6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===
Doug has a script using field $9 and $11 to determine whether this is a
random or sequential operation. Is it still valid in latest vxvm?
Thanks in advance,
carl
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
Carl Ma
2010-09-21 03:05:52 UTC
Permalink
Thanks, this is important. My understanding of concurrency means this I/O
operation was split into 31 sub-tasks and are running at the same time?
"op" is short for "operation". Each line in the output of vxtrace is prefixed
with a number. This "op number" tells you which other lines forms a complete
trace of an I/O. Examine the output of vxtrace for all lines containing "4275"
and all these lines represent one I/O.
Post by Carl E. Ma
Hi,
We are using volume manager 5.0 on solaris 9. I have difficulty to interpret
vxtrace output. What does "op" and "concurrency" stand for?
==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid
6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===
Doug has a script using field $9 and $11 to determine whether this is a
random or sequential operation. Is it still valid in latest vxvm?
Thanks in advance,
carl
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
William Havey
2010-09-21 03:19:31 UTC
Permalink
I've not been able to find an understandable description of "concurrency".
But, I have an example vxtrace output from a simple stripe volume with a bad
stripe size:

vxtrace -g S1dg1 -f /tmp/appiolab1.out -o dev,disk | pg

1 START write vdev test block 35840 len 64 concurrency 1 *pid 10931*

2 START write disk c1t3d0s2 op 1 block 20224 len 16

3 START write disk c1t3d1s2 op 1 block 20224 len 16

4 START write disk c1t3d0s2 op 1 block 20240 len 16

5 START write disk c1t3d1s2 op 1 block 20240 len 16

2 END write disk c1t3d0s2 op 1 block 20224 len 16 time 0

3 END write disk c1t3d1s2 op 1 block 20224 len 16 time 0

4 END write disk c1t3d0s2 op 1 block 20240 len 16 time 1

5 END write disk c1t3d1s2 op 1 block 20240 len 16 time 1

1 END write vdev test op 1 block 35840 len 64 time 1

All the "op 1" statements define a complete I/O, which includes 4 separate
I/Os but the concurrency is 1. By your statement the concurrency should be
4.

I do have this statement I found somewhere: Concurrency means "the number of
threads monitoring the i/o". But like I say, that statement is meaningless
to me.

Sorry I can be of only a negative sort of help on this.
Post by Carl Ma
Thanks, this is important. My understanding of concurrency means this I/O
operation was split into 31 sub-tasks and are running at the same time?
"op" is short for "operation". Each line in the output of vxtrace is
prefixed with a number. This "op number" tells you which other lines forms a
complete trace of an I/O. Examine the output of vxtrace for all lines
containing "4275" and all these lines represent one I/O.
Hi,
We are using volume manager 5.0 on solaris 9. I have difficulty to
interpret vxtrace output. What does "op" and "concurrency" stand for?
==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid 6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===
Doug has a script using field $9 and $11 to determine whether this is a
random or sequential operation. Is it still valid in latest vxvm?
Thanks in advance,
carl
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
------------------------------
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
Carl Ma
2010-09-22 01:25:47 UTC
Permalink
Symantec support is very helpful on this matter. He confirmed that
concurrency is the number of I/O operation on the volume.

For example:

4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid
6134

³Concurrency 31² means there are 31 write operations on volume ­ homevol01.

Cheers,

carl
Post by William Havey
I've not been able to find an understandable description of "concurrency".
But, I have an example vxtrace output from a simple stripe volume with a bad
vxtrace -g S1dg1 -f /tmp/appiolab1.out -o dev,disk | pg
1 START write vdev test block 35840 len 64 concurrency 1 pid 10931
2 START write disk c1t3d0s2 op 1 block 20224 len 16
3 START write disk c1t3d1s2 op 1 block 20224 len 16
4 START write disk c1t3d0s2 op 1 block 20240 len 16
5 START write disk c1t3d1s2 op 1 block 20240 len 16
2 END write disk c1t3d0s2 op 1 block 20224 len 16 time 0
3 END write disk c1t3d1s2 op 1 block 20224 len 16 time 0
4 END write disk c1t3d0s2 op 1 block 20240 len 16 time 1
5 END write disk c1t3d1s2 op 1 block 20240 len 16 time 1
1 END write vdev test op 1 block 35840 len 64 time 1
All the "op 1" statements define a complete I/O, which includes 4 separate
I/Os but the concurrency is 1. By your statement the concurrency should be 4.
I do have this statement I found somewhere: Concurrency means "the number of
threads monitoring the i/o". But like I say, that statement is meaningless to
me.
Sorry I can be of only a negative sort of help on this.
Post by Carl Ma
Thanks, this is important. My understanding of concurrency means this I/O
operation was split into 31 sub-tasks and are running at the same time?
Post by William Havey
"op" is short for "operation". Each line in the output of vxtrace is
prefixed with a number. This "op number" tells you which other lines forms a
complete trace of an I/O. Examine the output of vxtrace for all lines
containing "4275" and all these lines represent one I/O.
Post by Carl E. Ma
Hi,
We are using volume manager 5.0 on solaris 9. I have difficulty to
interpret vxtrace output. What does "op" and "concurrency" stand for?
==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid
6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===
Doug has a script using field $9 and $11 to determine whether this is a
random or sequential operation. Is it still valid in latest vxvm?
Thanks in advance,
carl
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
William Havey
2010-09-22 02:38:48 UTC
Permalink
31 write operations occurring in parallel? Or, perhaps, 31 concurrent I/Os,
some reads some writes? If this is the case using vxbench specifying
rand_write, nthreads=5, for example, should result in concurrency=5 during
the run of the vxbench utility.

I am skeptical that this will actually occur. I look in my examples and
concurrency and the number of I/O threads do not match.
Post by Carl Ma
Symantec support is very helpful on this matter. He confirmed that
concurrency is the number of I/O operation on the volume.
4275 START write vdev homevol01 block 5771246912 len 128* concurrency 31*pid 6134
“Concurrency 31” means there are 31 write operations on volume – homevol01.
Cheers,
carl
I've not been able to find an understandable description of "concurrency".
But, I have an example vxtrace output from a simple stripe volume with a bad
vxtrace -g S1dg1 -f /tmp/appiolab1.out -o dev,disk | pg
1 START write vdev test block 35840 len 64 concurrency 1 *pid 10931
*
2 START write disk c1t3d0s2 op 1 block 20224 len 16
3 START write disk c1t3d1s2 op 1 block 20224 len 16
4 START write disk c1t3d0s2 op 1 block 20240 len 16
5 START write disk c1t3d1s2 op 1 block 20240 len 16
2 END write disk c1t3d0s2 op 1 block 20224 len 16 time 0
3 END write disk c1t3d1s2 op 1 block 20224 len 16 time 0
4 END write disk c1t3d0s2 op 1 block 20240 len 16 time 1
5 END write disk c1t3d1s2 op 1 block 20240 len 16 time 1
1 END write vdev test op 1 block 35840 len 64 time 1
All the "op 1" statements define a complete I/O, which includes 4 separate
I/Os but the concurrency is 1. By your statement the concurrency should be 4.
I do have this statement I found somewhere: Concurrency means "the number
of threads monitoring the i/o". But like I say, that statement is
meaningless to me.
Sorry I can be of only a negative sort of help on this.
Thanks, this is important. My understanding of concurrency means this I/O
operation was split into 31 sub-tasks and are running at the same time?
"op" is short for "operation". Each line in the output of vxtrace is
prefixed with a number. This "op number" tells you which other lines forms a
complete trace of an I/O. Examine the output of vxtrace for all lines
containing "4275" and all these lines represent one I/O.
Hi,
We are using volume manager 5.0 on solaris 9. I have difficulty to
interpret vxtrace output. What does "op" and "concurrency" stand for?
==example of output line===
4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31 pid 6134
4254 END write vdev homevol01 op 4254 block 208575678 len 16 time 0
==end of example===
Doug has a script using field $9 and $11 to determine whether this is a
random or sequential operation. Is it still valid in latest vxvm?
Thanks in advance,
carl
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
------------------------------
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
Terrie Douglas
2010-09-22 16:16:37 UTC
Permalink
Concurrency indicates the number of concurrent processes that are
detected. This number increases as additional processes are detected.



In the output of the vxtrace data, the number of concurrent processes
running against the volume can be determined by counting the number of
START vdev entries that do not have END vdev entries.



(note this is from an older document)

hth





Thanks,

Terrie Douglas
Sr. Prin. Technical Support Engineer
Symantec Software Corporation
Email: ***@symantec.com

Phone: 650-527-3040
Customer Support: 1(800) 342-0652







View your case online at: https://mysupport.symantec.com



Save time and visit the Veritas Installation Assessment Services website
and check out our automated tools: https://vias.symantec.com/main.php





From: veritas-vx-***@mailman.eng.auburn.edu
[mailto:veritas-vx-***@mailman.eng.auburn.edu] On Behalf Of William
Havey
Sent: Tuesday, September 21, 2010 7:39 PM
To: Carl Ma
Cc: Veritas-***@mailman.eng.auburn.edu
Subject: Re: [Veritas-vx] vxtrace output



31 write operations occurring in parallel? Or, perhaps, 31 concurrent
I/Os, some reads some writes? If this is the case using vxbench
specifying rand_write, nthreads=5, for example, should result in
concurrency=5 during the run of the vxbench utility.

I am skeptical that this will actually occur. I look in my examples and
concurrency and the number of I/O threads do not match.

On Tue, Sep 21, 2010 at 6:25 PM, Carl Ma <***@yahoo.ca> wrote:

Symantec support is very helpful on this matter. He confirmed that
concurrency is the number of I/O operation on the volume.

For example:

4275 START write vdev homevol01 block 5771246912 len 128 concurrency 31
pid 6134

"Concurrency 31" means there are 31 write operations on volume -
homevol01.

Cheers,

carl
Ashish Yajnik
2010-10-06 17:26:38 UTC
Permalink
MPxIO with VxVM is only supported with Sun storage. If you run into problems with MPxIO and SF on XP24K then support will not be able to help you. I would recommend using DMP with XP24K.

Ashish
--------------------------
Sent using BlackBerry


----- Original Message -----
From: veritas-vx-***@mailman.eng.auburn.edu <veritas-vx-***@mailman.eng.auburn.edu>
To: Sebastien DAUBIGNE <***@atosorigin.com>; undisclosed-recipients <"undisclosed-recipients:;"@mailman.eng.auburn.edu>
Cc: Veritas-***@mailman.eng.auburn.edu <Veritas-***@mailman.eng.auburn.edu>
Sent: Wed Oct 06 10:08:08 2010
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

Hi Sebastien,

In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?

Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path
before
MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without
tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the list
I
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1
Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's moot
in the presence of a TPD.
From your modinfo, you're not actually running MP3. This technote
(http://seer.entsupport.symantec.com/docs/327057.htm) isn't exactly
your scenario, but looking for partially-installed pkgs is a good start
to getting your server correctly installed, then the tuneable should
work -- very early 5.0 versions had a differently-named tuneable I
can't find in my mail archive ATM.
Cheers,
Jf
-----Original Message-----
Sent: Thursday, September 16, 2010 7:41 AM
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Thank you Victor and William, it seems to be a very good lead.
Unfortunately, this tunable seems not to be supported in the VxVM
Post by William Havey
vxdmpadm gettune dmp_fast_recovery
VxVM vxdmpadm ERROR V-5-1-12015 Incorrect tunable
vxdmpadm gettune [tunable name]
Note - Tunable name can be dmp_failed_io_threshold, dmp_retry_count,
dmp_pathswitch_blks_shift, dmp_queue_depth, dmp_cache_open,
dmp_daemon_count, dmp_scsi_timeout, dmp_delayq_interval,
dmp_path_age,
or dmp_stat_interval
Something odd because my version is 5.0 MP3 Solaris SPARC, and
according
to http://seer.entsupport.symantec.com/docs/316981.htm this tunable
should be available.
Post by William Havey
modinfo | grep -i vx
38 7846a000 3800e 288 1 vxdmp (VxVM 5.0-2006-05-11a: DMP
Drive)
40 784a4000 334c40 289 1 vxio (VxVM 5.0-2006-05-11a I/O driver)
42 783ec71d df8 290 1 vxspec (VxVM 5.0-2006-05-11a
control/st)
296 78cfb0a2 c6b 291 1 vxportal (VxFS 5.0_REV-5.0A55_sol portal
)
297 78d6c000 1b9d4f 8 1 vxfs (VxFS 5.0_REV-5.0A55_sol SunOS 5)
298 78f18000 a270 292 1 fdd (VxQIO 5.0_REV-5.0A55_sol Quick )
Post by William Havey
Which version of veritas? Version 4/2MP2 and version 5.x introduced
a
Post by William Havey
feature called DMP fast recovery. It was probably supposed to be
called DMP fast fail but "recovery" sounds better. It is supposed to
fail suspect paths more aggressively to speed up failover. But when
you only have one vxvm DMP path, as is the case with MPxIO, and
fast-recovery fails that path, then you're in trouble. In version
5.x,
Post by William Havey
it is possible to disable this feature.
Google DMP fast recovery.
http://seer.entsupport.symantec.com/docs/307959.htm
I can imagine there must have been some internal fights at symantec
between product management and QA to get that feature released.
Vic
On Thu, Sep 16, 2010 at 6:03 AM, Sebastien DAUBIGNE
Post by Sebastien DAUBIGNE
Dear Vx-addicts,
- Solaris 9 HW 9/05
- SUN SAN (SFS) 4.4.15
- Emulex with SUN generic driver (emlx)
- VxVM 5.0-2006-05-11a
- storage on HP SAN (XP 24K).
Multipathing is managed by MPxIO (not VxDMP) because the SAN team
and HP
Post by William Havey
Post by Sebastien DAUBIGNE
VxVM ==> VxDMP ==> MPxIO ==> FCP ...
We have 2 paths to the switch, linked to 2 paths to the storage, so
the
Post by William Havey
Post by Sebastien DAUBIGNE
LUNs have 4 paths, with active/active support.
Failover operation has been tested successfully by offlining each
port
Post by William Havey
Post by Sebastien DAUBIGNE
successively on the SAN.
We regulary have transient I/O errors (scsi timeout, I/O error
retries
Post by William Havey
Post by Sebastien DAUBIGNE
with "Unit attention"), due to SAN-side issues. Usually these
errors are
Post by William Havey
Post by Sebastien DAUBIGNE
transparently managed by MPxIO/VxVM without impact on the
applications.
Post by William Havey
Post by Sebastien DAUBIGNE
One of the SAN port was reset , consequently there were some
transient
Post by William Havey
Post by Sebastien DAUBIGNE
I/O error.
The other SAN port was OK, so the MPxIO multipathing layer should
have
Post by William Havey
Post by Sebastien DAUBIGNE
failover the I/O on the other path, without transmiting the error
to the
Post by William Havey
Post by Sebastien DAUBIGNE
VxDMP layer.
For some reason, it did not failover the I/O before VxVM caught it
as
Post by William Havey
Post by Sebastien DAUBIGNE
unrecoverable I/O error, disabling the subdisk and consequently the
filesystem.
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x558 belonging to the dmpnode
288/0x60
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x60
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x538 belonging to the dmpnode
288/0x20
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-112 disabled path 118/0x550 belonging to the dmpnode
288/0x18
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x20
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxdmp V-5-0-111 disabled dmpnode 288/0x18
Sep 1 06:18:54 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:19:05 myserver SCSI transport failed: reason
retrying command
Sep 1 06:21:57 myserver SCSI transport failed: reason
'tran_err': retrying command
Sep 1 06:22:45 myserver SCSI transport failed: reason
retrying command
Sep 1 06:23:03 myserver SCSI transport failed: reason
giving up
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxio V-5-3-0 voldmp_errbuf_sio_start: Failed to flush the error
buffer
Post by William Havey
Post by Sebastien DAUBIGNE
300ce41c340 on device 0x1200000003a to DMP
VxVM
Post by William Havey
Post by Sebastien DAUBIGNE
vxio V-5-0-2 Subdisk mydisk_2-02 block 5935: Uncorrectable write
error
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
1 mesg 037: V-2-37: vx_metaioerr - vx_logbuf_clean -
/dev/vx/dsk/mydg/vol1 file system meta data write error in
dev/block 0/5935
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
2 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/mydg/vol1 file system
disabled
msgcnt
Post by William Havey
Post by Sebastien DAUBIGNE
3 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone -
/dev/vx/dsk/mydg/vol1 file system meta data write error in
dev/block
Post by William Havey
Post by Sebastien DAUBIGNE
0/265984
It seems VxDMP gets the I/O error at the same time as MPxIO : I
though
Post by William Havey
Post by Sebastien DAUBIGNE
MPxIO would have conceal the I/O error until failover has occured,
which
Post by William Havey
Post by Sebastien DAUBIGNE
is not the case.
As a workaround, I increased the VxDMP
recoveryotion/fixedretry/retrycount tunable from 5 to 20 to give
MPxIO a
Post by William Havey
Post by Sebastien DAUBIGNE
chance to failover before VxDMP fails, but I still don't understand
why
Post by William Havey
Post by Sebastien DAUBIGNE
VxVM catch the scsi errors.
Any advice ?
thanks.
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
--
Sebastien DAUBIGNE
AtosOrigin Infogerance - AIS/D1/SudOuest/Bordeaux/IS-Unix
_______________________________________________
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-vx
_______________________________________________
Veritas-vx maillist - Veritas-***@mailman.eng.auburn.edu
http://ma
Victor Engle
2010-10-06 19:48:23 UTC
Permalink
This is absolutely false!

MPxIO is an excellent multipathing solution and is supported by all
major storage vendors including HP. This issue discussed in this
thread has to do with improper behavior of DMP when multipathing is
managed by a native layer like MPxIO.

Storage and OS vendors have no motivation to lock you into a veritas solution.

Or, Ashish, are you saying that your Symantec is locking Symantec
customers into DMP? Hitachi, EMC, NetApp and HP all have supported
configurations which include vxvm and native OS multipathing stacks.

Thanks,
Vic


On Wed, Oct 6, 2010 at 1:26 PM, Ashish Yajnik
Post by Ashish Yajnik
MPxIO with VxVM is only supported with Sun storage. If you run into problems with MPxIO and SF on XP24K then support will not be able to help you. I would recommend using DMP with XP24K.
Ashish
--------------------------
Sent using BlackBerry
----- Original Message -----
Sent: Wed Oct 06 10:08:08 2010
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi Sebastien,
In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?
Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
  Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path
before
MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA without
tunable  is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the list
I
 > vxdmpadm gettune all
             Tunable               Current Value  Default Value
------------------------------    -------------  -------------
dmp_failed_io_threshold               57600            57600
dmp_retry_count                           5                5
dmp_pathswitch_blks_shift                11               11
dmp_queue_depth                          32               32
dmp_cache_open                           on               on
dmp_daemon_count                         10               10
dmp_scsi_timeout                         30               30
dmp_delayq_interval                      15               15
dmp_path_age                              0              300
dmp_stat_interval                         1                1
dmp_health_time                           0               60
dmp_probe_idle_lun                       on               on
dmp_log_level                             4                1
Cheers.
Post by Joshua Fielden
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's moot
in the presence of a TPD.
Post by Joshua Fielden
 
Christian Gerbrandt
2010-10-06 21:02:00 UTC
Permalink
We support several 3rd party multipathing solutions, like MPxIO or EMCs PowerPath.
However, MPxIO is only supported on Sun branded Storages.
DMP has also been known to outperform other solutions in certain configurations.

When a 3rd party multipathing is in use, DMP will fail back into TPD mode (Third Party Driver), and let the underlaying multipathing do its job.
That's when you see just a single disk in VxVM, when you know you have more than one path per disk.

I would recommend to install the 5.0 MP3 RP4 patch, and then check again if MPxIO is still misbehaving.
Or ideally, switch over to DMP.

-----Original Message-----
From: veritas-vx-***@mailman.eng.auburn.edu [mailto:veritas-vx-***@mailman.eng.auburn.edu] On Behalf Of Victor Engle
Sent: 06 October 2010 20:48
To: Ashish Yajnik
Cc: ***@atosorigin.com; "undisclosed-recipients:, "@mailman.eng.auburn.edu; Veritas-***@mailman.eng.auburn.edu
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue

This is absolutely false!

MPxIO is an excellent multipathing solution and is supported by all major storage vendors including HP. This issue discussed in this thread has to do with improper behavior of DMP when multipathing is managed by a native layer like MPxIO.

Storage and OS vendors have no motivation to lock you into a veritas solution.

Or, Ashish, are you saying that your Symantec is locking Symantec customers into DMP? Hitachi, EMC, NetApp and HP all have supported configurations which include vxvm and native OS multipathing stacks.

Thanks,
Vic
Post by Ashish Yajnik
MPxIO with VxVM is only supported with Sun storage. If you run into problems with MPxIO and SF on XP24K then support will not be able to help you. I would recommend using DMP with XP24K.
Ashish
--------------------------
Sent using BlackBerry
----- Original Message -----
undisclosed-recipients
Sent: Wed Oct 06 10:08:08 2010
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi Sebastien,
In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?
Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path
before MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA
without tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1
Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's
moot in the presence of a TPD.
Sebastien DAUBIGNE
2010-10-07 09:12:33 UTC
Permalink
Hi,

Thank you all for your feedback.

I am very surprised that MPxIO+DMP is only supported on Sun storages :
as stated in my very first message, the MPxIO solution was imposed by
our SAN team, following HP recommendations.

When we joined this SAN, I asked to go with DMP for multipathing layer
because we usually adopt this solution for all our
Solaris+VxVM+dedicated storage configuration, regardless of the storage
hardware : for instance with EMC hardware we use DMP and not Powerpath
and it works like a charm.
Unfortunately the SAN team and HP told us that for Solaris servers
incluing thoses with VxVM, we must use MPxIO otherwise they would not
support it, hence we used MPxIO.

Now for the issue, the question is still : will 5.0 bypass the MPxIO
layer for error detection or is this functionality only implemented
starting at MP2 ?
The idea is to be sure that this is a fast recovery issue and not
anything else.

Cheers,
Post by Christian Gerbrandt
We support several 3rd party multipathing solutions, like MPxIO or EMCs PowerPath.
However, MPxIO is only supported on Sun branded Storages.
DMP has also been known to outperform other solutions in certain configurations.
When a 3rd party multipathing is in use, DMP will fail back into TPD mode (Third Party Driver), and let the underlaying multipathing do its job.
That's when you see just a single disk in VxVM, when you know you have more than one path per disk.
I would recommend to install the 5.0 MP3 RP4 patch, and then check again if MPxIO is still misbehaving.
Or ideally, switch over to DMP.
-----Original Message-----
Sent: 06 October 2010 20:48
To: Ashish Yajnik
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
This is absolutely false!
MPxIO is an excellent multipathing solution and is supported by all major storage vendors including HP. This issue discussed in this thread has to do with improper behavior of DMP when multipathing is managed by a native layer like MPxIO.
Storage and OS vendors have no motivation to lock you into a veritas solution.
Or, Ashish, are you saying that your Symantec is locking Symantec customers into DMP? Hitachi, EMC, NetApp and HP all have supported configurations which include vxvm and native OS multipathing stacks.
Thanks,
Vic
Post by Ashish Yajnik
MPxIO with VxVM is only supported with Sun storage. If you run into problems with MPxIO and SF on XP24K then support will not be able to help you. I would recommend using DMP with XP24K.
Ashish
--------------------------
Sent using BlackBerry
----- Original Message -----
undisclosed-recipients
Sent: Wed Oct 06 10:08:08 2010
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi Sebastien,
In the first mail you mentioned that you are using mpxio to control the XP24K array. Why are you using mpxio here?
Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path
before MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA
without tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1
Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's
moot in the presence of a TPD.
Sebastien DAUBIGNE
2010-10-07 09:27:43 UTC
Permalink
I found this technote which confirmed your statement Christian :
http://www.symantec.com/business/support/index?page=content&id=TECH51507

"- Storage Foundation on Solaris sparc and X64 is supported with MPxIO
on Sun Storage hardware only. Storage Foundation does not support MPxIO
on non-sun storage arrays. For Non-Sun storage hardware, DMP is
required. If MPxIO is enabled on a host, the tunable dmp_fast_recovery
must be set to off: vxdmpadm settune dmp_fast_recovery=off."
Post by Sebastien DAUBIGNE
Hi,
Thank you all for your feedback.
as stated in my very first message, the MPxIO solution was imposed by
our SAN team, following HP recommendations.
When we joined this SAN, I asked to go with DMP for multipathing layer
because we usually adopt this solution for all our
Solaris+VxVM+dedicated storage configuration, regardless of the
storage hardware : for instance with EMC hardware we use DMP and not
Powerpath and it works like a charm.
Unfortunately the SAN team and HP told us that for Solaris servers
incluing thoses with VxVM, we must use MPxIO otherwise they would not
support it, hence we used MPxIO.
Now for the issue, the question is still : will 5.0 bypass the MPxIO
layer for error detection or is this functionality only implemented
starting at MP2 ?
The idea is to be sure that this is a fast recovery issue and not
anything else.
Cheers,
Post by Christian Gerbrandt
We support several 3rd party multipathing solutions, like MPxIO or EMCs PowerPath.
However, MPxIO is only supported on Sun branded Storages.
DMP has also been known to outperform other solutions in certain configurations.
When a 3rd party multipathing is in use, DMP will fail back into TPD
mode (Third Party Driver), and let the underlaying multipathing do
its job.
That's when you see just a single disk in VxVM, when you know you
have more than one path per disk.
I would recommend to install the 5.0 MP3 RP4 patch, and then check
again if MPxIO is still misbehaving.
Or ideally, switch over to DMP.
-----Original Message-----
Victor Engle
Sent: 06 October 2010 20:48
To: Ashish Yajnik
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
This is absolutely false!
MPxIO is an excellent multipathing solution and is supported by all
major storage vendors including HP. This issue discussed in this
thread has to do with improper behavior of DMP when multipathing is
managed by a native layer like MPxIO.
Storage and OS vendors have no motivation to lock you into a veritas solution.
Or, Ashish, are you saying that your Symantec is locking Symantec
customers into DMP? Hitachi, EMC, NetApp and HP all have supported
configurations which include vxvm and native OS multipathing stacks.
Thanks,
Vic
On Wed, Oct 6, 2010 at 1:26 PM, Ashish
Post by Ashish Yajnik
MPxIO with VxVM is only supported with Sun storage. If you run into
problems with MPxIO and SF on XP24K then support will not be able to
help you. I would recommend using DMP with XP24K.
Ashish
--------------------------
Sent using BlackBerry
----- Original Message -----
undisclosed-recipients
Sent: Wed Oct 06 10:08:08 2010
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi Sebastien,
In the first mail you mentioned that you are using mpxio to control
the XP24K array. Why are you using mpxio here?
Thanks,
Venkata Sreenivasarao Nagineni,
Symantec
Post by Joshua Fielden
-----Original Message-----
Sent: Wednesday, October 06, 2010 9:32 AM
To: undisclosed-recipients
Subject: Re: [Veritas-vx] Solaris-SFS / MPxIO / VxVM failover issue
Hi,
I come back with my dmp_fast_recovery issue (VxDMP fails the path
before MPxIO gets a chance to failover on alternate path).
As stated previously, I am running 5.0GA, and this tunable is not
supported in this release. However I still don't know if VxVM 5.0GA
silently bypasses the MPxIO stack for error recovery.
Now I try to determine if upgrading to MP3 will resolve this issue
(which rarely occured).
Could anyone (maybe Joshua ?) explain if the behaviour of 5.0GA
without tunable is functionally identical to dmp_fast_recovery=0 or
dmp_fast_recovery=1 ? Maybe the mechanism has been implemented in 5.0
without the option to disable it (this could explain my issue) ?
Joshua, you mentioned another tuneable for 5.0 but looking at the
vxdmpadm gettune all
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_failed_io_threshold 57600 57600
dmp_retry_count 5 5
dmp_pathswitch_blks_shift 11 11
dmp_queue_depth 32 32
dmp_cache_open on on
dmp_daemon_count 10 10
dmp_scsi_timeout 30 30
dmp_delayq_interval 15 15
dmp_path_age 0 300
dmp_stat_interval 1 1
dmp_health_time 0 60
dmp_probe_idle_lun on on
dmp_log_level 4 1
Cheers.
dmp_fast_recovery is a mechanism by which we bypass the sd/scsi stack
and send path inquiry/status CDBs directly from the HBA in order to
bypass long SCSI queues and recover paths faster. With a TPD (third-
party driver) such as MPxIO, bypassing the stack means we bypass the
TPD completely, and interactions such as this can happen. The vxesd
(event-source daemon) is another 5.0/MP2 backport addition that's
moot in the presence of a TPD.
Loading...