VMware Cloud Community
andvm
Hot Shot
Hot Shot
Jump to solution

Inject permanent disk failure

Hi,

POC testing, just getting message in bold.

python vsanDiskFaultInjection.pyc -p -d naa.xxxxxxxxxxxx

When FRA is enabled, inject permanent errors only if there is a single DG.

Then I wait for 15mins and more but still nothing happens (no errors shown even after refreshing sphere client)

Running on VMware ESXi, 7.0.0, 16324942

Any advise? (Note I tried both on a cache and on a capacity one at a time, tried also on a different host running same ESXi version)

Thanks

1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello andvm​,

Does the node you are testing this on have more than one Disk-Group?

If so, can you test it on a node with only a single Disk-Group or does this invalidate the point of the test? (e.g. testing rebuild on the other Disk-Groups of that node when one is failed)

If that is not an option, there may be an alternative: that script *basically* just calls vsish commands - however it is my understanding that in the recent versions there have been some amendments to this and it may not be as simple as this due to changes/feature additions (e.g. it has to do some other things as well) - can you try injecting these directly?

Find the disk you want to impact mount point with esxcfg-mpath -L then use this info to populate the vmhbaX:CX:TX:LX portion of the below:

# vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

# vsish -e set /storage/scsifw/paths/vmhbaX:CX:TX:LX/injectError 0x03110300000002

One more caveat - I think there *may* have been some changes to the errors injected called (e.g. 0x03110300000002 here) such that different ones are used specifically depending on certain factors (e.g. configuration and/or features enabled) so I can't promise the above will work and/or be 100% reflective of how this would behave in the wild for your specific configuration.

More information on the above and the other vsish calls here:

Disk Failures | vSAN 6.7 U 3 Proof of Concept Guide | VMware

Bob

View solution in original post

0 Kudos
2 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello andvm​,

Does the node you are testing this on have more than one Disk-Group?

If so, can you test it on a node with only a single Disk-Group or does this invalidate the point of the test? (e.g. testing rebuild on the other Disk-Groups of that node when one is failed)

If that is not an option, there may be an alternative: that script *basically* just calls vsish commands - however it is my understanding that in the recent versions there have been some amendments to this and it may not be as simple as this due to changes/feature additions (e.g. it has to do some other things as well) - can you try injecting these directly?

Find the disk you want to impact mount point with esxcfg-mpath -L then use this info to populate the vmhbaX:CX:TX:LX portion of the below:

# vsish -e set /reliability/vmkstress/ScsiPathInjectError 0x1

# vsish -e set /storage/scsifw/paths/vmhbaX:CX:TX:LX/injectError 0x03110300000002

One more caveat - I think there *may* have been some changes to the errors injected called (e.g. 0x03110300000002 here) such that different ones are used specifically depending on certain factors (e.g. configuration and/or features enabled) so I can't promise the above will work and/or be 100% reflective of how this would behave in the wild for your specific configuration.

More information on the above and the other vsish calls here:

Disk Failures | vSAN 6.7 U 3 Proof of Concept Guide | VMware

Bob

0 Kudos
andvm
Hot Shot
Hot Shot
Jump to solution

Hi TheBobkin

I removed the 2nd disk group

Injected an error into a capacity disk - it was marked as Permanent Device failure - VM components on the specific disk marked as Absent - Resync started after a few mins

Injected an error into a cache disk - it was marked as Permanent Device failure - VM components on the disk group marked as Absent - Resync started after a few mins

Think this is enough to proof the concept.

Thanks