Skip navigation

Aleksander Bukowinski's Blog

July 28, 2020 Previous day Next day

Not long ago I had to prepare a procedure to prepare for the full site maintenance in environment based on vSAN 6.7 stretched cluster.

This activity was caused by a planned power maintenance that would affect all ESXi servers in one site.

 

Of course, the assumption is that there are enough resources on the other site to host all VMs. This is a requirement for stretched cluster anyway.

The L3 network topology remains unchanged as the routers are not affected by the power maintenance.

As majority of the VMs use FTT=1 (one copy per site + witness in third site) there are four potential scenarios to accomplish this task.

 

NOTE: Site which is going to be powered off is called Site-A (preferred), and the one that will stay up is called Site-B (non-preferred), Site-C is where vSAN witness is based.

 

Option A – Dual site mirroring

Change the storage policy for all VMs to enable “Dual site mirroring (stretched cluster)” and “Failures to tolerate” to 1 which will provide additional site-local protection for the VMs.

Each VM will have 4 copies in total (2 in Site-A, and 2 in Site-B + witness in Site-C)

  • This scenario provides higher availability, lower risk, and protection even in case of a hardware failure during maintenance.However, on the flip side it requires additional storage space (x2 in both sites) and may take significant amount of time (new copies have to be created).

 

Option B – No change

Do not change the storage policies for any VMs.

  • This option does not require extra space, is fast (no new copies) but introduces higher risk as there will only be only one copy of data available during planned maintenance. Any potential hardware failures might result in loss of storage access for the VMs.

 

 

Option C - Hybrid

Change the storage policy only for some – selected VMs. These VMs will benefit from “Dual site mirroring (stretched cluster)” and “Failures to tolerate” set to 1. Other, less important VMs will have their policies unchanged just like in option B.

  • This is a hybrid scenario that combines benefits and drawbacks of the two other options A and B.

 

 

Option D – Affinity

Change the storage policy and set the site affinity to Site-B. For all or some selected VMs set the policy with “None – keep data on Preferred (stretched cluster)”.

Because this operation will be done after Site-B is set as preferred it will migrate data from the Site-Ato the Site-B.

  • In this scenario all copies will be stored only in the Site-B. Enough space space on the Site-B will be required and the process of migrating data might take some time. There is a potential risk involved,  if after migration the entire Site-B goes down all copies will become inaccessible.

 

Site-A shut-down procedure

Regardless of the option selected the procedure looks as follows:

 

  1. Check the vSAN health and verify that everything is ok, and there are no on-going sync operations.
  2. Verify that VMs are compliant with their storage policies.
  3. Make sure that vSAN Witness will not be affected by the planned maintenance.
  4. As the site that is going to be shut down is “preferred” in vSAN, set Site-B as preferred. After that operation Site-B becomes "preferred" site.

  1. Switch DRS from “fully automated” to “partially automated”.
  2. Only for scenarios A, C and D: Change the storage policy for VMs and wait until data migration/sync process is over.
  3. Switch DRS from “fully automated” to “partially automated”.
  4. VMotion all VMs from the Site-A to the Site-B.
  5. Place the ESXi hosts in the Site-A into maintenance mode using “ensure accessibility”. Do not put more than one ESXi host into maintenance mode at a time.
  6. Switch DRS back to “fully automated”.
  7. Power-off the ESXi hosts in Site-A.

 

 

Site-A bring-up procedure

After the planned maintenance is over the following steps should be taken:

 

  1. Power on the ESXi hosts in Site-A, wait until they are reconnected to the vCenter.
  2. Verify that the vSAN health is ok.
  3. Exit the maintenance mode on the ESXi hosts – this should trigger migration of the VMs based on their VM/Hosts rules. Otherwise, migrate the appropriate VMs manually to Site-A.
  4. Only for scenarios A, C and D: Change the VMs storage policies back to the original settings wait until data sync process is over.
  5. Make the Site-A "preferred" site in vSAN.

Recently I have encountered a problem when installing NSX-T 2.5 (and 3.0) on ESXi 6.7u3.

The initial configuration failed with the following errors (NSX-T 2.5):  “Failed to install software on host. Create user [nsxuser] failed on …” or (NSX-T 3.0): “Failed to install software on host. Unable to add user on host…”

 

NSX-T 2.5.1:

 

NSX-T 3.0:

 

After some troubleshooting it turned out that the problem was caused by ESXi password and account lockout policy which got changed.

During initial configuration NSX-T creates a user (nsxuser) on the ESXi hosts. If the password policy is too restrictive the NSX-T generated password is not compliant and user creation fails. This results in the installation failure.

 

The quick solution to the problem is to temporarily change the password and lockout policy on ESXi hosts for the NSX-T installation.

This can be done by modifying the “Security.PasswordQualityControl” advanced parameter on the ESXi hosts.

After changing this parameter to the default value “retry=3 min=disabled,disabled,disabled,7,7” and using the “RESOLVE” buttion in NSX-T installation succeeded.

 

Once NSX-T got installed on all ESXi hosts got password policy can ba changed back to the previous state.

 

More information regarding setting the password policy can be found here:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.esxi.upgrade.doc/GUID-DC96FFDB-F5F2-43EC-8C73-05ACDAE6BE43.html