Vmware 6.7 U3 - 6.7.0.40000 - 14367737
Esxi 6.7 U3 - 14320388
Vsan On-Disk format version – 10.0
4 nod vsan cluster with 2x10Gbits optical fiber nics.
HCIBench 2.2.1
I’m receiving error messages Lost access to volume during Vsan tests with HCIBench.
HCIBench settings was Easy Run and VDBENCH.
Now starts the interesting part.
This error messages Lost access to volume are experienced on the SECOND test vdb-8vmdk-100ws-4k-100rdpct-100randompct-4threads exactly 16 seconds after the start of the test and ONLY when the test phase include Virtual Disk Preparation ZERO. If you do not delete the guest VMs and just Reuse them, then no errors at all.
Now because the error Lost access to volume also contain due to connectivity issues and I have not find any other error in the logs, which to explain the connectivity issues, nor congestions, I have started and logged time stamped vmkpings from all esxi hosts to all vsan vmkernel adapters and I have not find any packet lost on the vsan interfaces in the vmkping logs starting 1 minute before the errors and during the hole error period.
This Lost access to volume errors are reported only once per VM but for all running VMs (also includes other VMs different from the HCIBench guest VMs) at that moment in that cluster during the hole test cycle run. The whole error period (the Lost and the Recovered) is different every time but usually is shorter like from 1 to 3 seconds and only once I have seen the time to be 8 seconds for that period. It looks like you hit them all together (the running VMs) in the same time with something. I have not found any indication for problems inside the running VMs during this Lost access to volume period.
It took me a while to see this errors as pattern and to narrow this strange problem how to be reproduced every time on my test cluster. For me this is some wired vsan bug.
I have tried to stress my test cluster with other ways that HCIBench but the problem Lost access to volume is experienced only with HCIBench test.
Alexander,
I can tell you that we had these "lost access to volume" issue when we ran the config mentioned in VMware Knowledge Base
i.e. R730 w/HBA330 w/FW16.17.00.03
So when the load got higher the messages would indicate that the backend somehow could not cope with it. Downgrading FW helped as the KB article promised.
In some cases VMs had issues performing disk writes... infrequently, we still get a warning in vSAN clusters with heavy IO bound VMs.
Thank you for your suggestion.
I have seen your suggestion, also I have seen "Lost access to volume messages with vSAN (59220)" VMware Knowledge Base and many other "Lost access to volume" articles.
How I stated - I have found specific situation in which I'm receiving this error message which looks like false positive message mentioned as resolved in "Lost access to volume messages with vSAN (59220)" "The issue has been resolved completely of the false messages in vSAN 6.7 Update 1 on wards."
Your suggestion for my problems is very different. How I sad early that I can't reproduce my problem with any other type for high load, only with HCIBench.
Also I have tested esxi 6.5 U2, esxi 6.5 U3, esxi 6.7, esxi 6.7 U1, esxi 6.7 U2 for stability and again with HCIBench and I have seen the same problem in all this versions, but during this tests I have not see that this is happening only in the specific circumstances/settings.
The reason for this extensive test was because I have upgraded my production cluster resources (RAM, SSD and HDD) and esxi version, and have started to have problems. The old used for past 3 years components was put in my test lab cluster, which was working without any issue during this past 3 years. During this test I have found problems with my Intel 82576 1Gb network adapters as mentioned in dead I/O on igb-nic (ESXi 6.7) (https://communities.vmware.com/thread/612777). With my extensive test with the old versions of esxi, I found that I have the same problem in all of this old versions with the intel 82576 nic, but because the VMs traffic was in LACP with 1x82576 and 3xI350-T4 when the nic drops for the LACP I haven't seen the error messages. For the vsan networking I was used two nics 1x82576 and 1xI350-T4, which was not in LACP and they have been connected to two different vsan vmkernel adapters with different IP networks (air gaped) and etc.
I have upgraded the 1Gb T-base network (6 x 1Gb nics) to 10Gb optical fiber (2 x 10Gb nics). After the network upgrade, the error messages for the network like "vSphere HA agent on XXXXXXXXXX in cluster YYYYYY in ZZZZZZZZ cannot reach some management network addresses of other hosts", "vmnic* is down", "lost redundancy uplink", "Uplink redundancy degraded", vSAN Health Test 'Network latency check' status changed from 'green' to 'yellow' was gone. Now I don't have any network related error messages in logs.
Now the only think which was left to clean as a problems in my test cluster is this error "Lost access to volume" which how I sad early I think it is a false positive message mentioned as resolved in "Lost access to volume messages with vSAN (59220)" "The issue has been resolved completely of the false messages in vSAN 6.7 Update 1 on wards."