VMware Cloud Community
AlexanderLiucka
Enthusiast
Enthusiast

Vsan Lost access to volume during HCIBench tests

Vmware 6.7 U3 - 6.7.0.40000  - 14367737

Esxi 6.7 U3 - 14320388

Vsan On-Disk format version – 10.0

4 nod vsan cluster with 2x10Gbits optical fiber nics.

HCIBench 2.2.1

I’m receiving error messages Lost access to volume during Vsan tests with HCIBench.

HCIBench settings was Easy Run and VDBENCH.

Now starts the interesting part.

This error messages Lost access to volume are experienced on the SECOND test vdb-8vmdk-100ws-4k-100rdpct-100randompct-4threads exactly 16 seconds after the start of the test and ONLY when the test phase include Virtual Disk Preparation ZERO. If you do not delete the guest VMs and just Reuse them, then no errors at all.

Now because the error Lost access to volume also contain due to connectivity issues and I have not find any other error in the logs, which to explain the connectivity issues, nor congestions,  I have started and logged time stamped vmkpings from all esxi hosts to all vsan vmkernel adapters and I have not find any packet lost on the vsan interfaces in the vmkping logs starting 1 minute before the errors and during the hole error period.

This Lost access to volume errors are reported only once per VM but for all running VMs (also includes other VMs different from the HCIBench guest VMs) at that moment in that cluster during the hole test cycle run. The whole error period (the Lost and the Recovered) is different every time but usually is shorter like from 1 to 3 seconds and only once I have seen the time to be 8 seconds for that period. It looks like you hit them all together (the running VMs) in the same time with something. I have not found any indication for problems inside the running VMs during this Lost access to volume period.

It took me a while to see this errors as pattern and to narrow this strange problem how to be reproduced every time on my test cluster. For me this is some wired vsan bug.

I have tried to stress my test cluster with other ways that HCIBench but the problem Lost access to volume is experienced only with HCIBench test.

Reply
0 Kudos
2 Replies
bmrkmr
Enthusiast
Enthusiast

Alexander,

I can tell you that we had these "lost access to volume" issue when we ran the config mentioned in VMware Knowledge Base

i.e. R730 w/HBA330 w/FW16.17.00.03

So when the load got higher the messages would indicate that the backend somehow could not cope with it. Downgrading FW helped as the KB article promised.

In some cases VMs had issues performing disk writes... infrequently, we still get a warning in vSAN clusters with heavy IO bound VMs.

Reply
0 Kudos
AlexanderLiucka
Enthusiast
Enthusiast

Reply
0 Kudos