Re: Performance Issues following an object format ...

RobWindham · ‎09-15-2022

I have a 4 node hybrid VXRail setup. We recently updated from 6.7 to 7.0 u3, which went smooth, and everything was working fine. A few weeks after that update, I noticed Skyline was reporting an object format health alert, stating that there were a bunch of objects and about 60TB of data that required an object format change. I came to find out that this was expected due to new storage features in vSAN 7. So I clicked the button to do the object format change over a weekend, which ended up taking about 4 days to complete. And ever since then, our performance, specifically SQL database performance, has been significantly worse. Users who access the application experience extreme slowness, queries take longer to run, etc. There is definitely more disk latency in the cluster than I am used to seeing. There are no changes to point to in our environment that could explain such a 180 in performance other than that object format change. I'm working with Dell on this, but they are not really taking seriously the fact that this started following that change. Right now they are just offering some best practices and tips to improve performance, which I'm happy to follow. But nothing has helped so far, and it's frustrating that prior to this everything had been running fine with minimal issue for years. I just thought I'd check here to see if anyone could offer any reason why performance would decrease following the object format change. If so, maybe there is a more direct fix. If anything, I would have expected things to get better after this. But right now I'm wishing I could just reverse it. Assuming that is not possible though. I have tried rebooting the hosts, but that has not helped.

bryanvaneeden · ‎09-16-2022

I've gone through the same part about 2 months ago, and in our case it was a 200TB resync but everything was done within a weekend and there have been no single performance issues since. So unfortunately I cannot help with this. But I thought I'd share my experience.

However, as far as I know you cannot revert a disk format change unless I think you wip the entire host and re-install them, but even then I am unsure. The changes between the mentioned versions are only meta-data changes so not really that much has changed as far as I know.

Have you already filed a case with VMware GSS so try and figure our a solution?

Visit my blog at https://vcloudvision.com!

TheBobkin · ‎09-19-2022

@RobWindham, the only things that I can think of that might result in worse performance after object format upgrade would be:

1. Everything (>255GB) gets deep-reconfigured as part of this process e.g. it re-writes all of the data components in new locations before discarding the original layouts components - these components may now be placed in 'hot-spots' e.g. multiple very write-intensive components residing on the same disk/Disk-Group where they didn't previously. Note that vSAN doesn't distribute components based on their IO usage/profile but just based on their size and the available space on disks/Disk-Groups/nodes and taking into account SPBM compliance.
Similarly, data-components could have got moved to a disk/Disk-Group that has some issue e.g. physical disk latency, Power-On Resets or some other local storage or controller issues. The former could be validated by changing storage policy in such a way that it does another deep-reconfiguration and moves the components again (e.g. increase stripe-width), but whether this helps or not depends where they are moved to (and whether this is much different from the current component locations). The latter should eb checked by Dell EMC from reviewing the logs of the hosts and/or validating whether any 'Backend' latency is observed in the vSAN performance graphs.

2. Imbalance from DOM-Owner perspective - there may be either an imbalance in how many objects or how intensive the use of objects are for one/some DOM-Owners in the cluster e.g. like the above, there is a hot-spot (but at a different layer), you could identify the DOM-Owner and UUID of one of the vmdk objects (via RVC or the output of 'esxcli vsan debug object list --all') you have a measurably poor performance on and get a new DOM-Owner elected for it and see if performance improves:
# vsish -e set /vmkModules/vsan/dom/ownerAbdicate <object uuid>

3. This cluster has some totally unrelated to object format issue which is impairing performance.

RobWindham · ‎09-19-2022

Thanks for the reply. I'll go to VMware next if I need to. Starting with Dell since it's technically their product on top of the VSan. Is your system all flash? Assuming that is the reason it took you a weekend to resync 200TB and 4 days for me to resync 60.

RobWindham · ‎09-19-2022

Thanks for the info. One of the actions I've taken since this all started was to create a new higher performance policy for the DB servers that increased striping and made them all thick provision. I assigned the policy, which definitely forced another resync. Unfortunately, I have no way to really put the same kind of load on these things that live users do when they run the application. So at some point soon we'll probably swing the application back into the Vsan datacenter and hope it's back to normal. If not, I'll look into the other possibilities you mentioned. Right now Dell is leaning towards an application issue since they can identify latency on the VM level but cannot see any corresponding performance issues on the physical disks. But the correlation between the format change and the problems starting is still too close to ignore. So if there still is a problem I'll try to rule out everything on your list before pushing this back to the DBAs. Thanks again though for pointing me in some different directions.

TheBobkin · ‎09-19-2022

@RobWindham "I'll go to VMware next if I need to. Starting with Dell since it's technically their product on top of the VSan."

Do note that if you purchased your licensing and S&S from Dell EMC (as is the case for 99% of VxRail solutions) then support for this is with Dell EMC (as that is what was purchased) - if there is something unexpected/unexplainable from the vSAN perspective and/or L1-2 troubleshooting hasn't determined the cause of the issue, then Dell EMC may engage VMware vSAN teams via B2B.

bryanvaneeden · ‎09-25-2022

Yes our environment is all Flash. I really do hope you get this resolved.

Visit my blog at https://vcloudvision.com!

All

Performance Issues following an object format change