Skip navigation
2020

Aleksander Bukowinski's Blog

July 2020 Previous month Next month

I came across a problem where vSAN reported problems with one of the virtual disks (vmdk) connected to vSAN witness appliance (cache disk).

The vSAN health reported “Operational Health Alarm” and all of the VMs objects showed “Reduced availability with no rebuild”.

Strangely enough the underlaying storage was not reporting any errors or problems and other VMs on the same datastore were all ok.

As there was no time to do proper investigation, the decision was made to restart the witness appliance VM.

When the witness came back online everything went back to normal and the errors were gone.

One thing that I noticed was that by some accident the witness appliance got included into snapshot-based (quiesced) backup policy, and a few hours before the accident the backup job got started. It crossed my mind that this problem may have something to do with quiesced snapshots.

 

I tried to reproduce this problem in my lab and I manage to create the same issue.

I generated some IO in my stretched vSAN cluster plus I executed some re-sync operations by changing storage policy settings.

At the same time I started to create quiesced snapshots on the witness appliance VM. After a while I noticed the same error in my lab.

 

The following alarm appeared on the witness nested ESXi:

 

The vSAN health reported “Operational health error” – permanent disk failure:

 

And “vSAN object health” showed “Reduced availability with no rebuild”:

 

Witness was inoperable at this stage and as it is a vital component of stretched vSAN cluster the whole environment got affected. Of course, existing VMs kept running as vSAN is a robust solution and can handle such situations (still had quorum). However without "Force Provisioning" in the storage policy no new object (VMs, snapshots, etc.) could be created.

 

Further investigation of the logs (vmkernel.log and vmkwarning.log) on the witness appliance revealed problems with access to the affected disk (vmhba1:C0:T0:L0)

 

That proved the problem was indeed virtual disk related and caused by the snapshot.

I tried to fix it by rescanning the storage adapter but to no avail, so I decided to reboot the appliance.

 

Once the appliance was on-line again the “Operational health error” disappeared.

However, there was still 7 objects with “Reduced availability with no rebuild”

 

After examining these objects, it turned out that the witness component was missing. Fortunately, it was quite easy to fix by using the “Repair Object Immediately” option in vSAN Health.

 

It looks like taking snapshots on the vSAN witness appliance not only does not make any sense (can’t think of any) but can also cause problems in the environment.

 

There is a configuration parameter that could prevent such accidents form happening - “snapshot.maxSnapshots”.

If it is set to “-0” on the VM level it will effectively disable snapshots for that VM, therefore I would strongly advise to set it for the vSAN witness appliance.

Not long ago I had to prepare a procedure to prepare for the full site maintenance in environment based on vSAN 6.7 stretched cluster.

This activity was caused by a planned power maintenance that would affect all ESXi servers in one site.

 

Of course, the assumption is that there are enough resources on the other site to host all VMs. This is a requirement for stretched cluster anyway.

The L3 network topology remains unchanged as the routers are not affected by the power maintenance.

As majority of the VMs use FTT=1 (one copy per site + witness in third site) there are four potential scenarios to accomplish this task.

 

NOTE: Site which is going to be powered off is called Site-A (preferred), and the one that will stay up is called Site-B (non-preferred), Site-C is where vSAN witness is based.

 

Option A – Dual site mirroring

Change the storage policy for all VMs to enable “Dual site mirroring (stretched cluster)” and “Failures to tolerate” to 1 which will provide additional site-local protection for the VMs.

Each VM will have 4 copies in total (2 in Site-A, and 2 in Site-B + witness in Site-C)

  • This scenario provides higher availability, lower risk, and protection even in case of a hardware failure during maintenance.However, on the flip side it requires additional storage space (x2 in both sites) and may take significant amount of time (new copies have to be created).

 

Option B – No change

Do not change the storage policies for any VMs.

  • This option does not require extra space, is fast (no new copies) but introduces higher risk as there will only be only one copy of data available during planned maintenance. Any potential hardware failures might result in loss of storage access for the VMs.

 

 

Option C - Hybrid

Change the storage policy only for some – selected VMs. These VMs will benefit from “Dual site mirroring (stretched cluster)” and “Failures to tolerate” set to 1. Other, less important VMs will have their policies unchanged just like in option B.

  • This is a hybrid scenario that combines benefits and drawbacks of the two other options A and B.

 

 

Option D – Affinity

Change the storage policy and set the site affinity to Site-B. For all or some selected VMs set the policy with “None – keep data on Preferred (stretched cluster)”.

Because this operation will be done after Site-B is set as preferred it will migrate data from the Site-Ato the Site-B.

  • In this scenario all copies will be stored only in the Site-B. Enough space space on the Site-B will be required and the process of migrating data might take some time. There is a potential risk involved,  if after migration the entire Site-B goes down all copies will become inaccessible.

 

Site-A shut-down procedure

Regardless of the option selected the procedure looks as follows:

 

  1. Check the vSAN health and verify that everything is ok, and there are no on-going sync operations.
  2. Verify that VMs are compliant with their storage policies.
  3. Make sure that vSAN Witness will not be affected by the planned maintenance.
  4. As the site that is going to be shut down is “preferred” in vSAN, set Site-B as preferred. After that operation Site-B becomes "preferred" site.

  1. Switch DRS from “fully automated” to “partially automated”.
  2. Only for scenarios A, C and D: Change the storage policy for VMs and wait until data migration/sync process is over.
  3. Switch DRS from “fully automated” to “partially automated”.
  4. VMotion all VMs from the Site-A to the Site-B.
  5. Place the ESXi hosts in the Site-A into maintenance mode using “ensure accessibility”. Do not put more than one ESXi host into maintenance mode at a time.
  6. Switch DRS back to “fully automated”.
  7. Power-off the ESXi hosts in Site-A.

 

 

Site-A bring-up procedure

After the planned maintenance is over the following steps should be taken:

 

  1. Power on the ESXi hosts in Site-A, wait until they are reconnected to the vCenter.
  2. Verify that the vSAN health is ok.
  3. Exit the maintenance mode on the ESXi hosts – this should trigger migration of the VMs based on their VM/Hosts rules. Otherwise, migrate the appropriate VMs manually to Site-A.
  4. Only for scenarios A, C and D: Change the VMs storage policies back to the original settings wait until data sync process is over.
  5. Make the Site-A "preferred" site in vSAN.

Recently I have encountered a problem when installing NSX-T 2.5 (and 3.0) on ESXi 6.7u3.

The initial configuration failed with the following errors (NSX-T 2.5):  “Failed to install software on host. Create user [nsxuser] failed on …” or (NSX-T 3.0): “Failed to install software on host. Unable to add user on host…”

 

NSX-T 2.5.1:

 

NSX-T 3.0:

 

After some troubleshooting it turned out that the problem was caused by ESXi password and account lockout policy which got changed.

During initial configuration NSX-T creates a user (nsxuser) on the ESXi hosts. If the password policy is too restrictive the NSX-T generated password is not compliant and user creation fails. This results in the installation failure.

 

The quick solution to the problem is to temporarily change the password and lockout policy on ESXi hosts for the NSX-T installation.

This can be done by modifying the “Security.PasswordQualityControl” advanced parameter on the ESXi hosts.

After changing this parameter to the default value “retry=3 min=disabled,disabled,disabled,7,7” and using the “RESOLVE” buttion in NSX-T installation succeeded.

 

Once NSX-T got installed on all ESXi hosts got password policy can ba changed back to the previous state.

 

More information regarding setting the password policy can be found here:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.esxi.upgrade.doc/GUID-DC96FFDB-F5F2-43EC-8C73-05ACDAE6BE43.html