In "cluster" VMware enable HA with DRS, constantly experience disconnection of the "hosts" of vcenter servers appliance
Versions of Vsphere 6.5 8294253
vcenter appliance 126.96.36.19900
Add the parameter
config.vpxd.heartbeat.notRespondingTimeout = 120
Restart the vcenter services, considering this link
However, you continue to experience the disconnection of the hosts of vcenter servers
Attached picture that our Server Calfaquen disconnection, but the event occurs with all servers
I would appreciate your support
This is a new deployment? Or old one and the problem only starts occurs now? The ESXi hosts and the vCenter appliance are on the same subnet or the vCSA are on different subnet and behind a firewall?
Check the following VMware KB article for additional settings that you need to investigate: VMware Knowledge Base
I would suggest to check the events for VCSA and see if there was backup, snapshot or any other tasks running on VCSA at the same time that the host became disconnected.
In large and enterprise scale environments it usually happens in wide mangement subnets, network congestion, low heartbeat timeout values and if the VC is too busy with lots of taks in queue.
Also I have seen that issue when crating snapshot or backups of the vCenter Server. Also high storage latency causes that issue as well.
vCENTER Appliance is on a different subnet than the esxi hosts, also modify the vpxa timeout on the Esxi and Vcenter Appliance hosts,
Also modify the teaming to vswitch level, because there was only one active adapter, All configured as active (4 adapter active).
At the vmkernel portgroup level, also modify the teaming and only consider an active adapter and 3 standby adapter.
However, the Esxi continues to be disconnected.
if it is possible that they can support,
I would suggest to check performance graphs of the both the ESXi host and the vCenter server at the occurrence time of the issue and look for potential high latency or high CPU/RAM usage.
Also check the timestamp of the alert and see if there was any backup job running at that time.
If not, then the potential issue can be the layer 3 network connectivity latency especially if you have fairewall doing the interVLAN routing.
To capture the network traffic you can use pktcap-uw --vmk vmk# -o file.pcap on ESXi shell and then open the captured file with WireShark as it is easier to view the contents of the pcap on WireShark.
Also you can check the hostd.log file and look for heartbeats.
You can run the below command and leave the SSH window open:
pktcap-uw --vmk vmk0 -o /tmp/test.pcap
(replace "vmk0" with the vmk# of your management vmk adapter of the ESXi host if it's not VMK0)
The packet capture will continously capture the traffic of the VMKernel onto test.pcap file and when you want to stop it, just press Ctrl-C multiple times. (Do not stop it by Ctrl-Z as it may leave the process running in background that won't release the output file).
Then open the file using Wireshark which is quite user friendly and easy to use.
Does this only affect management communication between hosts and vCenter? Or are the VMs no longer accessible via network, too? Does this affect all hosts at the same time or only one host?
Which physical network card is installed in the hosts? We recently had the case where individual hosts repeatedly lost the complete network connection. We had a faulty driver for the Intel X710 cards (driver i40en and the bug was fixed in version 1.7.1).