Re: vsan resync performance

sysadmzzz · ‎08-14-2017

Hi,

I'm testing vSAN feature with ESXi v6.5.0 U1. For first, focusing to see the behavior, I built three nodes with HPE H220 HBA, 256G and 2T SATA SSD.

To try the dedupe, I tried to enable and disable dedupe&compression and now have been waiting for 4 days to complete for resync only 500G.

My resync rate is now 1.4M/sec and disk write latency records 300ms. it's way too slow, considering I can't even create new VM during this resync.

I know that H220 is not certified for all-flush and my SATA SSDs are not certified for vSAN, but can this be such slow?

seems too sensitive.

Any comments are welcome.

TheBobkin · ‎08-15-2017

Hello,

First off, welcome to vSAN Communities!

We hope you find useful info here, if you find any comments helpful consider marking them so.

Using hardware not certified for vSAN or for the specific purpose they are certified for is extremely likely to have unpredictable results and from my experience (working with this product 10+ hrs a day for over a year!) these can vary from sub-standard performance all the way to practically non-functional/data-loss.

A key indication that a hardware component will likely have a poor outcome is if it is certified for one use (like the H220 and Hybrid) but not other uses - this basically tells us that it was tested for both and either failed to meet the standards necessary or had real unfixable issues that blocked certification.

The same can be expected with SSDs that are certified for cache-tier in Hybrid but not All-Flash.

Are the cache-tier SSDs SATA? SATA devices have a shallow queue-depth and thus are not ideal for this purpose, I wouldn't advise them for capacity-tier either if this can be avoided.

It may be possible to determine if it is the controller, cache-tier SSDs or capacity-tier SSDs that are the true bottleneck here using vSAN Observer but it is very likely a combination of all of these:

kb.vmware.com/kb/2064240

Enabling Dedupe+Compression initiates a rolling on-disk upgrade which involves evacuating(or deleting if less than 4-nodes) disk-groups on each node and migrating/rebuilding all the data on each node one at a time until all are complete.

Thus the question arises - are you sure the same data has been resyncing for 4 days or has this been adding on data as it resyncs?

You can check this using RVC or the Web Client by looking at the Objects being resynced and noting them.

virten.net/2017/05/vsan-6-6-rvc-guide-part-1-basic-configuration/

Bob

sysadmzzz · ‎08-15-2017

Thanks TheBobkin!

In short, you said that non-certified device may cause useless performance with vSAN, right?

Yes, I'm sure that resync data has not much change during resync process. it included only test VMs.

TheBobkin · ‎08-16-2017

Hello,

In short - Yes, but I felt this needed a *few* more details so that it is clear.

At the same time though, it could be encountering issues such as HBA reset/aborts or congestion that *may* have some temporary or long-term solutions that may increase performance and stability -start with looking at the Health check (Cluster > Monitor > vSAN > Health), look in dmesg or vmkernel.log for congestion threshold reached or H:0x7 or H:0x8 SCSI Sense codes.

Regarding what you mentioned about not being able to create VMs during this process - this is possible but they will have to have an FTT=0 (Failures To Tolerate) Storage Policy (SP), or an SP with 'Force Provisioning' as a Rule-set (this basically allows to create Objects with reduced availability that will be made full once enough nodes are usable).

Bob

sysadmzzz · ‎08-16-2017

Thanks for your response, Bob!

Cluseter->Monitor->vSAN->Health shows nothing specific or Congestion, besides vSAN object health or vSAN cluster configuration consistency.

looking at vmkernel.log and found following but I'm not sure whether it's what you're referring.

2017-08-16T07:40:09.080Z cpu44:2082573)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.500a075109048e2b:2

2017-08-16T07:42:48.844Z cpu38:65639)ScsiDeviceIO: 2948: Cmd(0x4396fd592200) 0x1a, CmdSN 0x8b23c from world 0 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

I'll try SP with a Rule-set.

Thanks again!

TheBobkin · ‎08-16-2017

Hello,

Those specific messages can be safely ignored.

Here is a good resource for translating Sense code data:

virten.net/vmware/esxi-scsi-sense-code-decoder/

This short script will show you the number of each type of Sense code in the current log:

#grep H: /var/log/vmkernel.log | awk -F failed '{print $2}' | sort | uniq -c

Try taking a look with vSAN Observer, this can be a great tool for identifying bottlenecks.

Bob

sysadmzzz · ‎08-16-2017

Hi Bob,

Here the output of your command. looks okay.

153 H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

I'm looking at vSAN Observer and have a question... what does PLOG IORETRY IOPS means?

it's shown high number at two of three nodes.