Skip navigation

vSphere replication can be used to protect vCenter server by replicating it to another location, it is not the best disaster recovery solution but it works. We can create point-in-time copies (PIT), select RPO, enable compression, etc. But what if vCenter actually breaks down and we want to recover it. And even worse what if we have to recover it to a previous point in time copy (because the latest replicated state is corruped) ? The problem is that the vSphere replication UI is embedded in vCenter UI, so no vCenter UI means no vSphere replication UI.

Well, there is a way to do it and I'm going to explain how it can be done.

 

First we need to know the datastore where vCenter was replicated:

 

And to what point-in-time copy (snapshot in reality) we want to recover. In this example we want to go back to the Oct 30, 2019 11:04 PIT.

As the vCenter is no longer available we have to login to the ESXi host as root (via ssh) and change to a directory where the replicated data is stored. The idea is to recreate the vCenter and point it to the files that were created during desired PIT. First we need to get the instance id, .vmx and .nvram files names (from PIT) and use them to create the standard .vmx and .nvram files. All the information is stored in the .txt file in the same directory.

 

Note: The timestamps are in UTC so in this case one -1 hour.

After that we have to do the same for the vCenter disks (vmdk). Just like on the screenshot below, based on disk id (disk.0) we replace the vc-5-02_2.vmdk with the PIT replica disks (hbrdisk) in vc-5-01.vmx file. This step is not required if we want to recover to the latest state of the vCenter (becasue the .vmx file already points to correct vmkd disks).

We need to repeate that operation with all other disks. Once this is over the .vmx file should look like this:

At this stage the .vmx file is reconfigured and we can register out replicated vCenter as a new VM. This can either be done using vim-cmd command or via ESXi UI.

Last step is to edit the VM settings and connect it to the correct port-group. As the orignal vCenter is down we can only connect the recoverd vCenter VM to a standard vswitch (vSS) port-group or to  ephemeral port-group on distributed vswitch (vDS). This is one of the reason why we should always create this kind of port-group on our management vDS.

 

With this step done the VM can be powered up and after a while the vCenter will be up and running again.

Note: It is important to answer "I moved it" if asked during power-on operation if you do not want to get new uuid.bios and uuid.location values, and a new MAC address.

 

Last thing to do is to clean-up and remove other snapshots from the vCenter VM (consolidate).

 

I have created a short video that shows the whole proces. It can be found on Linkedin: Aleksander Bukowinski posted on LinkedIn

I came across a problem where vSAN reported problems with one of the virtual disks (vmdk) connected to vSAN witness appliance (cache disk).

The vSAN health reported “Operational Health Alarm” and all of the VMs objects showed “Reduced availability with no rebuild”.

Strangely enough the underlaying storage was not reporting any errors or problems and other VMs on the same datastore were all ok.

As there was no time to do proper investigation, the decision was made to restart the witness appliance VM.

When the witness came back online everything went back to normal and the errors were gone.

One thing that I noticed was that by some accident the witness appliance got included into snapshot-based (quiesced) backup policy, and a few hours before the accident the backup job got started. It crossed my mind that this problem may have something to do with quiesced snapshots.

 

I tried to reproduce this problem in my lab and I manage to create the same issue.

I generated some IO in my stretched vSAN cluster plus I executed some re-sync operations by changing storage policy settings.

At the same time I started to create quiesced snapshots on the witness appliance VM. After a while I noticed the same error in my lab.

 

The following alarm appeared on the witness nested ESXi:

 

The vSAN health reported “Operational health error” – permanent disk failure:

 

And “vSAN object health” showed “Reduced availability with no rebuild”:

 

Witness was inoperable at this stage and as it is a vital component of stretched vSAN cluster the whole environment got affected. Of course, existing VMs kept running as vSAN is a robust solution and can handle such situations (still had quorum). However without "Force Provisioning" in the storage policy no new object (VMs, snapshots, etc.) could be created.

 

Further investigation of the logs (vmkernel.log and vmkwarning.log) on the witness appliance revealed problems with access to the affected disk (vmhba1:C0:T0:L0)

 

That proved the problem was indeed virtual disk related and caused by the snapshot.

I tried to fix it by rescanning the storage adapter but to no avail, so I decided to reboot the appliance.

 

Once the appliance was on-line again the “Operational health error” disappeared.

However, there was still 7 objects with “Reduced availability with no rebuild”

 

After examining these objects, it turned out that the witness component was missing. Fortunately, it was quite easy to fix by using the “Repair Object Immediately” option in vSAN Health.

 

It looks like taking snapshots on the vSAN witness appliance not only does not make any sense (can’t think of any) but can also cause problems in the environment.

 

There is a configuration parameter that could prevent such accidents form happening - “snapshot.maxSnapshots”.

If it is set to “-0” on the VM level it will effectively disable snapshots for that VM, therefore I would strongly advise to set it for the vSAN witness appliance.

Not long ago I had to prepare a procedure to prepare for the full site maintenance in environment based on vSAN 6.7 stretched cluster.

This activity was caused by a planned power maintenance that would affect all ESXi servers in one site.

 

Of course, the assumption is that there are enough resources on the other site to host all VMs. This is a requirement for stretched cluster anyway.

The L3 network topology remains unchanged as the routers are not affected by the power maintenance.

As majority of the VMs use FTT=1 (one copy per site + witness in third site) there are four potential scenarios to accomplish this task.

 

NOTE: Site which is going to be powered off is called Site-A (preferred), and the one that will stay up is called Site-B (non-preferred), Site-C is where vSAN witness is based.

 

Option A – Dual site mirroring

Change the storage policy for all VMs to enable “Dual site mirroring (stretched cluster)” and “Failures to tolerate” to 1 which will provide additional site-local protection for the VMs.

Each VM will have 4 copies in total (2 in Site-A, and 2 in Site-B + witness in Site-C)

  • This scenario provides higher availability, lower risk, and protection even in case of a hardware failure during maintenance.However, on the flip side it requires additional storage space (x2 in both sites) and may take significant amount of time (new copies have to be created).

 

Option B – No change

Do not change the storage policies for any VMs.

  • This option does not require extra space, is fast (no new copies) but introduces higher risk as there will only be only one copy of data available during planned maintenance. Any potential hardware failures might result in loss of storage access for the VMs.

 

 

Option C - Hybrid

Change the storage policy only for some – selected VMs. These VMs will benefit from “Dual site mirroring (stretched cluster)” and “Failures to tolerate” set to 1. Other, less important VMs will have their policies unchanged just like in option B.

  • This is a hybrid scenario that combines benefits and drawbacks of the two other options A and B.

 

 

Option D – Affinity

Change the storage policy and set the site affinity to Site-B. For all or some selected VMs set the policy with “None – keep data on Preferred (stretched cluster)”.

Because this operation will be done after Site-B is set as preferred it will migrate data from the Site-Ato the Site-B.

  • In this scenario all copies will be stored only in the Site-B. Enough space space on the Site-B will be required and the process of migrating data might take some time. There is a potential risk involved,  if after migration the entire Site-B goes down all copies will become inaccessible.

 

Site-A shut-down procedure

Regardless of the option selected the procedure looks as follows:

 

  1. Check the vSAN health and verify that everything is ok, and there are no on-going sync operations.
  2. Verify that VMs are compliant with their storage policies.
  3. Make sure that vSAN Witness will not be affected by the planned maintenance.
  4. As the site that is going to be shut down is “preferred” in vSAN, set Site-B as preferred. After that operation Site-B becomes "preferred" site.

  1. Switch DRS from “fully automated” to “partially automated”.
  2. Only for scenarios A, C and D: Change the storage policy for VMs and wait until data migration/sync process is over.
  3. Switch DRS from “fully automated” to “partially automated”.
  4. VMotion all VMs from the Site-A to the Site-B.
  5. Place the ESXi hosts in the Site-A into maintenance mode using “ensure accessibility”. Do not put more than one ESXi host into maintenance mode at a time.
  6. Switch DRS back to “fully automated”.
  7. Power-off the ESXi hosts in Site-A.

 

 

Site-A bring-up procedure

After the planned maintenance is over the following steps should be taken:

 

  1. Power on the ESXi hosts in Site-A, wait until they are reconnected to the vCenter.
  2. Verify that the vSAN health is ok.
  3. Exit the maintenance mode on the ESXi hosts – this should trigger migration of the VMs based on their VM/Hosts rules. Otherwise, migrate the appropriate VMs manually to Site-A.
  4. Only for scenarios A, C and D: Change the VMs storage policies back to the original settings wait until data sync process is over.
  5. Make the Site-A "preferred" site in vSAN.

Recently I have encountered a problem when installing NSX-T 2.5 (and 3.0) on ESXi 6.7u3.

The initial configuration failed with the following errors (NSX-T 2.5):  “Failed to install software on host. Create user [nsxuser] failed on …” or (NSX-T 3.0): “Failed to install software on host. Unable to add user on host…”

 

NSX-T 2.5.1:

 

NSX-T 3.0:

 

After some troubleshooting it turned out that the problem was caused by ESXi password and account lockout policy which got changed.

During initial configuration NSX-T creates a user (nsxuser) on the ESXi hosts. If the password policy is too restrictive the NSX-T generated password is not compliant and user creation fails. This results in the installation failure.

 

The quick solution to the problem is to temporarily change the password and lockout policy on ESXi hosts for the NSX-T installation.

This can be done by modifying the “Security.PasswordQualityControl” advanced parameter on the ESXi hosts.

After changing this parameter to the default value “retry=3 min=disabled,disabled,disabled,7,7” and using the “RESOLVE” buttion in NSX-T installation succeeded.

 

Once NSX-T got installed on all ESXi hosts got password policy can ba changed back to the previous state.

 

More information regarding setting the password policy can be found here:

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.esxi.upgrade.doc/GUID-DC96FFDB-F5F2-43EC-8C73-05ACDAE6BE43.html