Hi,
POC testing, just getting message in bold.
python vsanDiskFaultInjection.pyc -p -d naa.xxxxxxxxxxxx
When FRA is enabled, inject permanent errors only if there is a single DG.
Then I wait for 15mins and more but still nothing happens (no errors shown even after refreshing sphere client)
Running on VMware ESXi, 7.0.0, 16324942
Any advise? (Note I tried both on a cache and on a capacity one at a time, tried also on a different host running same ESXi version)
Thanks
Hello andvm,
Does the node you are testing this on have more than one Disk-Group?
If so, can you test it on a node with only a single Disk-Group or does this invalidate the point of the test? (e.g. testing rebuild on the other Disk-Groups of that node when one is failed)
If that is not an option, there may be an alternative: that script *basically* just calls vsish commands - however it is my understanding that in the recent versions there have been some amendments to this and it may not be as simple as this due to changes/feature additions (e.g. it has to do some other things as well) - can you try injecting these directly?
Find the disk you want to impact mount point with esxcfg-mpath -L then use this info to populate the vmhbaX:CX:TX:LX portion of the below:
# vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
# vsish -e set /storage/scsifw/paths/vmhbaX:CX:TX:LX/injectError 0x03110300000002
One more caveat - I think there *may* have been some changes to the errors injected called (e.g. 0x03110300000002 here) such that different ones are used specifically depending on certain factors (e.g. configuration and/or features enabled) so I can't promise the above will work and/or be 100% reflective of how this would behave in the wild for your specific configuration.
More information on the above and the other vsish calls here:
Disk Failures | vSAN 6.7 U 3 Proof of Concept Guide | VMware
Bob
Hello andvm,
Does the node you are testing this on have more than one Disk-Group?
If so, can you test it on a node with only a single Disk-Group or does this invalidate the point of the test? (e.g. testing rebuild on the other Disk-Groups of that node when one is failed)
If that is not an option, there may be an alternative: that script *basically* just calls vsish commands - however it is my understanding that in the recent versions there have been some amendments to this and it may not be as simple as this due to changes/feature additions (e.g. it has to do some other things as well) - can you try injecting these directly?
Find the disk you want to impact mount point with esxcfg-mpath -L then use this info to populate the vmhbaX:CX:TX:LX portion of the below:
# vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1
# vsish -e set /storage/scsifw/paths/vmhbaX:CX:TX:LX/injectError 0x03110300000002
One more caveat - I think there *may* have been some changes to the errors injected called (e.g. 0x03110300000002 here) such that different ones are used specifically depending on certain factors (e.g. configuration and/or features enabled) so I can't promise the above will work and/or be 100% reflective of how this would behave in the wild for your specific configuration.
More information on the above and the other vsish calls here:
Disk Failures | vSAN 6.7 U 3 Proof of Concept Guide | VMware
Bob
Hi TheBobkin
I removed the 2nd disk group
Injected an error into a capacity disk - it was marked as Permanent Device failure - VM components on the specific disk marked as Absent - Resync started after a few mins
Injected an error into a cache disk - it was marked as Permanent Device failure - VM components on the disk group marked as Absent - Resync started after a few mins
Think this is enough to proof the concept.
Thanks