Solved: Enabling HA causes NFS storage loss

JohnNWCU · ‎03-27-2023

Hello,

Short version is, when I enabled HA on a cluster, all 4 hosts lost their NFS datastore and it became inaccessible, requiring all hosts to be rebooted to see their storage.

Long version: 4x Cisco M5 blades. Each host is fully up to date on ESXi 7.0 21313628. Each host is using nenic vib v1.0.45 which is the latest validated rev from cisco. When I enabled HA, everything appeared to be going as expected. Here are the tasks:

Once that was completed, VMs began going "inaccessible". I logged directly into the ESXi UI and tried to browse the datastore to which it also said "not accessible".

Rebooting each host allowed the host to see its storage again. This happened before and after 6 weeks of troubleshooting VMware said it was due to the nenic driver version (v1.0.35) and that they needed to be updated to at least v1.0.42 or higher. That obviously wasn't the answer (we're now on Cisco validated v1.0.45), and I have another case opened with them with log bundles from before, during the issue, and after. Not sure what they're going to come up with this time. But my confidence is low.

Any help or insight would be greatly appreciated.

JohnNWCU · ‎04-13-2023

I did figure out what the issue was and the resolution. Our clusters has been set up where both datastores are mapped to both clusters. After enabling HA on the first cluster, it seems that clusters HA master 'held' on to both clusters for HA. So, when I enabled HA on the second cluster, its almost as if the HA masters of each cluster were fighting over them for HA causing the storage loss.

To combat this, I had to manually set the heartbeat datastore for each cluster, only allowing one datastore to be used. Once this was done, HA worked without issue.

View solution in original post

JohnNWCU · ‎04-13-2023

I did figure out what the issue was and the resolution. Our clusters has been set up where both datastores are mapped to both clusters. After enabling HA on the first cluster, it seems that clusters HA master 'held' on to both clusters for HA. So, when I enabled HA on the second cluster, its almost as if the HA masters of each cluster were fighting over them for HA causing the storage loss.

To combat this, I had to manually set the heartbeat datastore for each cluster, only allowing one datastore to be used. Once this was done, HA worked without issue.

depping · ‎04-14-2023

very strange, never seen (or heard about) this before. the HA folders (and heartbeat files/region) typically is the cluster ID, so this shouldn't be happening.

All

Enabling HA causes NFS storage loss

APD

Cisco

ESXi

HA

NFS