Re: Resync components in vSAN takes a long time!

ToniCapablo · ‎05-07-2020

I have a vCSA + 3 ESXi on same cluster and a few days ago, I checked status in vSAN and I see that many components takes very long time to succesfully resync.

Despite of doing the action via vSphere Web Client to repair objects inmediately, did not work.

I see many post in this forum and I check that there would be commands like vsan.resync... It affects to normal functionality of VMs in the moment you put these commands?

Is there a way to check, how to repair these object manually? On the other hand, the VMs, ESXi and vCSA works OK, but for example when I try to clone a VM with resync components already working, there is no way, it fails to do the process (I understand because in this moment is doing resync components).

Thanks in advance. Regards!

TheBobkin · ‎05-07-2020

Hello Toni,

"I see that many components takes very long time to succesfully resync"

If you could be more specific it may help narrow this down:

- Are the same Objects resyncing as were a few days ago or different Objects?

- Is the resync 'looping' for some Objects (e.g. it starts at 200GB to resync, gets down to 50GB then goes back to 200GB)?

- Are the Objects in question Inaccessible? - If a resync (e.g. repair) for an Object started and the only other full replica of the data was lost, these will not progress resync as there is no data to read from.

- What is the given 'intent' of the resync - if you go to Cluster > Monitor > Resyncing components , it should state the intent (e.g. compliance, rebalance, repair etc.).

- What build version of vCenter and ESXi are in use here? (more helpful to state 'build:15820472' as opposed to '6.7')

- If you go to Cluster > Monitor > vSAN > Health - do you have any triggered red alerts? If you do then please attach/PM screenshot of this with the drop-down details shown.

"I see many post in this forum and I check that there would be commands like vsan.resync... It affects to normal functionality of VMs in the moment you put these commands?"

Commands such as vsan.resync_dashboard , vsan.disks_stats and vsan.obj_status_report are basically just 'get' commands and do not cause any impact when ran - please if you can attach/PM the output of these 3 commands run against the cluster in question.

"but for example when I try to clone a VM with resync components already working, there is no way, it fails to do the process (I understand because in this moment is doing resync components)."

There shouldn't be an issue with cloning a VM/vmdk if it is resyncing, this indicates a more severe issue than just resync being slow, this indicates that potentially the Objects are Inaccessible or in an otherwise impaired data state.

Please attach/PM the output of this grep against a log file run on all 3 hosts:

Bob

ToniCapablo · ‎05-08-2020

Hi Bob! Thank you for the quick reply. I will try to reply all the questions:

"I see that many components takes very long time to succesfully resync"

If you could be more specific it may help narrow this down:

- Are the same Objects resyncing as were a few days ago or different Objects?

Same Objects

- Is the resync 'looping' for some Objects (e.g. it starts at 200GB to resync, gets down to 50GB then goes back to 200GB)?

Now, it has 243.49 GB to resync. I refresh and it changes, similar to a loop

- Are the Objects in question Inaccessible? - If a resync (e.g. repair) for an Object started and the only other full replica of the data was lost, these will not progress resync as there is no data to read from.

Some VMs are in reduced availlability, but they are working OK

- What is the given 'intent' of the resync - if you go to Cluster > Monitor > Resyncing components , it should state the intent (e.g. compliance, rebalance, repair etc.).

Because for any reason, backups of VMs (via vSphere Data Protection installed in solution) did not work with this VM that there are syncing

- What build version of vCenter and ESXi are in use here? (more helpful to state 'build:15820472' as opposed to '6.7')

ESXi version: 6.5.0 Update 1 (Build 5969303)

vCSA version: 6.5.0 (Build 7312210)

vSAN version (Health Service): 6.6.1

- If you go to Cluster > Monitor > vSAN > Health - do you have any triggered red alerts? If you do then please attach/PM screenshot of this with the drop-down details shown.

"I see many post in this forum and I check that there would be commands like vsan.resync... It affects to normal functionality of VMs in the moment you put these commands?"

Commands such as vsan.resync_dashboard , vsan.disks_stats and vsan.obj_status_report are basically just 'get' commands and do not cause any impact when ran - please if you can attach/PM the output of these 3 commands run against the cluster in question.

1. vsan.resync_dashboard (I omitted VM names for security reasons)

/vCenter IP Address/Datacenter/computers/CLUSTER> vsan.resync_dashboard .

2020-05-08 14:01:53 +0200: Querying all VMs on vSAN ...

2020-05-08 14:01:53 +0200: Querying all objects in the system from esxi01-... ...

2020-05-08 14:01:53 +0200: Got all the info, computing table ...

+----------------------------------------------------------------------------------------+-----------------+---------------+

| VM/Object | Syncing objects | Bytes to sync |

+----------------------------------------------------------------------------------------+-----------------+---------------+

| one_vm | 1 | |

| [vsanDatastore] 9ca7e05a-310b-c9db-05c0-98f2b325f0e0/one_vm.vmdk | | 63.72 GB |

| two_vm | 1 | |

| [vsanDatastore] a17d065b-bee5-42e5-aaaa-98f2b325f0e0/two_vm.vmdk | | 21.10 GB |

| three_vm | 1 | |

| [vsanDatastore] 467a0d5b-5a9c-2a0e-4d32-98f2b325f0e0/three_vm.vmdk | | 95.98 GB |

| four_vm | 1 | |

| [vsanDatastore] 019cd95a-e9a9-3f07-0589-1c98ec1de210/four_vm.vmdk | | 56.03 GB |

| vcenter.... | 1 | |

| [vsanDatastore] 578b585a-41e8-78b3-4e5f-98f2b325f0e0/vcenter....vmdk | | 6.59 GB |

+----------------------------------------------------------------------------------------+-----------------+---------------+

| Total | 5 | 243.41 GB |

+----------------------------------------------------------------------------------------+-----------------+---------------+

2. vsan_disks_stats (I omitted ESXi names for security reasons)

+---------------------+--------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+

+---------------------+--------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+

| mpx.vmhba1:C2:T0:L0 | esxi01-... | SSD | 0 | 894.25 GB | 0.00 % | 0.00 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| mpx.vmhba1:C2:T1:L0 | esxi01-... | MD | 16 | 846.94 GB | 19.16 % | 2.75 % | 2540.81 GB | 19.18 % | 1.79 % | 8942.50 GB | 4.42 % | 0.26 % | OK (v5) |

| mpx.vmhba1:C2:T2:L0 | esxi01-... | MD | 16 | 846.94 GB | 19.16 % | 1.32 % | 2540.81 GB | 19.18 % | 1.79 % | 8942.50 GB | 3.14 % | 0.12 % | OK (v5) |

| mpx.vmhba1:C2:T3:L0 | esxi01-... | MD | 15 | 846.94 GB | 19.16 % | 1.32 % | 2540.81 GB | 19.18 % | 1.79 % | 8942.50 GB | 7.79 % | 0.12 % | OK (v5) |

+---------------------+--------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+

| mpx.vmhba1:C2:T0:L0 | esxi02-... | SSD | 0 | 894.25 GB | 0.00 % | 0.00 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| mpx.vmhba1:C2:T1:L0 | esxi02-... | MD | 11 | 846.94 GB | 10.52 % | 1.31 % | 2540.81 GB | 10.53 % | 3.53 % | 8942.50 GB | 1.88 % | 0.12 % | OK (v5) |

| mpx.vmhba1:C2:T3:L0 | esxi02-... | MD | 19 | 846.94 GB | 10.52 % | 5.34 % | 2540.81 GB | 10.53 % | 3.53 % | 8942.50 GB | 3.65 % | 0.51 % | OK (v5) |

| mpx.vmhba1:C2:T2:L0 | esxi02-... | MD | 12 | 846.94 GB | 10.52 % | 3.94 % | 2540.81 GB | 10.53 % | 3.53 % | 8942.50 GB | 3.25 % | 0.37 % | OK (v5) |

+---------------------+--------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+

| mpx.vmhba1:C2:T0:L0 | esxi03-... | SSD | 0 | 894.25 GB | 0.00 % | 0.00 % | N/A | N/A | N/A | N/A | N/A | N/A | OK (v5) |

| mpx.vmhba1:C2:T2:L0 | esxi03-... | MD | 16 | 846.94 GB | 8.89 % | 3.44 % | 2540.81 GB | 8.89 % | 3.45 % | 8942.50 GB | 1.78 % | 0.33 % | OK (v5) |

| mpx.vmhba1:C2:T3:L0 | esxi03-... | MD | 14 | 846.94 GB | 8.89 % | 2.03 % | 2540.81 GB | 8.89 % | 3.45 % | 8942.50 GB | 2.39 % | 0.19 % | OK (v5) |

| mpx.vmhba1:C2:T1:L0 | esxi03-... | MD | 14 | 846.94 GB | 8.89 % | 4.88 % | 2540.81 GB | 8.89 % | 3.45 % | 8942.50 GB | 2.45 % | 0.46 % | OK (v5) |

+---------------------+--------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+

3. vsan.obj_status_report /vCenter IP Address/Datacenter Name/computers/Cluster Name

2020-05-08 14:18:15 +0200: Querying all VMs on vSAN ...

2020-05-08 14:18:15 +0200: Querying all objects in the system from esxi01-... ...

2020-05-08 14:18:16 +0200: Querying all disks in the system from esxi01-.... ...

2020-05-08 14:18:16 +0200: Querying all components in the system from esxi01-....es ...

2020-05-08 14:18:17 +0200: Querying all object versions in the system ...

2020-05-08 14:18:20 +0200: Got all the info, computing table ...

Histogram of component health for non-orphaned objects

+-------------------------------------+------------------------------+

| Num Healthy Comps / Total Num Comps | Num objects with such status |

+-------------------------------------+------------------------------+

| 3/3 (OK) | 38 |

| 4/4 (OK) | 3 |

| 6/6 (OK) | 1 |

+-------------------------------------+------------------------------+

Total non-orphans: 42

Histogram of component health for possibly orphaned objects

+-------------------------------------+------------------------------+

| Num Healthy Comps / Total Num Comps | Num objects with such status |

+-------------------------------------+------------------------------+

| 0/3 (Unavailable) | 1 |

+-------------------------------------+------------------------------+

Total orphans: 1

Total v1 objects: 0

Total v2 objects: 0

Total v2.5 objects: 0

Total v3 objects: 0

Total v5 objects: 43

"but for example when I try to clone a VM with resync components already working, there is no way, it fails to do the process (I understand because in this moment is doing resync components)."

There shouldn't be an issue with cloning a VM/vmdk if it is resyncing, this indicates a more severe issue than just resync being slow, this indicates that potentially the Objects are Inaccessible or in an otherwise impaired data state.

Please attach/PM the output of this grep against a log file run on all 3 hosts:

I attached the three logs files (one for each esxi)

TheBobkin · ‎05-08-2020

Hello Toni,

The vobd.logs are flooded with messages relating to unrecoverable checksum issues - the build and configuration you are may indicate that you are hitting this issue that was patched nearly 3 years ago:

VMware Knowledge Base

I would advise as the kb does to engage GSS vSAN for further troubleshooting - note that simply patching will not resolve the issue.

Bob

ToniCapablo · ‎05-10-2020

Thanks for the support and quick reply.

I will consider your thoughts, I hope would be useful. The mainly problem is this virtualization environment is working on and doing these type of actions, I must secure everything before proceed to an update, for example.

Regards!

ToniCapablo · ‎05-10-2020

Anyway... in the case of patching the there esxi... one question:

For patching, it is better to isolate one esxi for patching (without any VM). Then patch it one esxi... but in the moment you are patching, when the patch is done, if you have experience in these terms, please tel if it will work the rest of VMs (working on the other esxi without patching).

Then, repeat the same in the other esxis untill you will have VM in a new environment, with thre esxi patched.

Am I right in these afirmations?

Regards,

TheBobkin · ‎05-12-2020

Hello Toni,

"Then, repeat the same in the other esxis untill you will have VM in a new environment, with thre esxi patched."

As I said above, simply updating ESXi won't fix the issues with your data if it is already in an impaired state.

Whether the issues with data here are as a result the issue I mentioned or something else (hardware/driver/firmware etc.) is generally not so easy to determine.

However, regardless of the cause, the most cautious approach to resolving this issue would be to clone (not SvMotion) the current VMs residing on this cluster to another datastore (VMFS/vSAN/NAS), wiping the current vSAN clean (e.g. recreate all Disk-Groups once empty and hosts are patched) and moving everything back - I am aware you said cloning VMs wasn't working, this is likely due to impaired state of the data, you *may* be able to clone the VMs by making a clone of the current Storage Policy applied to the VMs to make a new Storage Policy and adding a rule to this for checksumDisabled and then applying this new Storage Policy to the VMs before cloning them.

Bob

ToniCapablo · ‎05-17-2020

Hi Bob...

I tried to move only one VM (there are many VMs with this issue, resyinc components, too slow) and besides of changing storage policy, did not work.

After a couple of minutes, I got this error:

I created new Policy, created with Checksum disabled. I was wondering, the rest of parameters are given by default, copying the vSAN Default Storage Policy:

Rule-set 1

Storage Type: VSAN

Primary level of failures to tolerate: 1

Number of disk stripes per object:1

Force provisioning: No

Object space reservation (%): 0

Flash read cache reservation (%): 0,0

Disable object checksum: Yes

Maybe I must (re)configure another parameter in this new policy in order to migrate VM to the other datastore?

Regards and thanks in advance for the replies!

TheBobkin · ‎05-19-2020

Hello Toni,

Please note that I advised to clone the impacted VMs off to external storage, not SvMotion them - regardless if one fails, the other likely would to.

There should be nothing else you need to change in that Storage Policy - please validate that it has been successfully applied to the Objects/VMs (though unfortunately this may not be possible with the data in an impaired state). If moving the data isn't possible (and backup+restore of the vmdks is not working) and the data is functional at the Guest-OS level, I would advise adding a new vmdk (stored on external datastore) to the VMs and moving the data within the Guest-OS to this new vmdk.

As I initially said, I would strongly advise that you open a support request with vSAN GSS - there are a number of potential things that may be possible to do/check/fix that are internal-only and thus which I am not at liberty to go into here.

Bob

All

Resync components in vSAN takes a long time!