sbmk
Contributor
Contributor

vSphere 7.0 : regular host disconnections

Hi,

Since we have insalled an brand new vSphere7 infrastructure, we are experiencing a strange problem : regulary, the state of the ESXi flashes to disconnected but comes back to OK immediately.

First I thought it was a random problem, but after starting to experience troubles with third party applications (backup, PowerCli) complaining about lost connections and closed sockets, I started to inestigate further.

And I managed to find this in "/var/log/vpxa.log" (situation is exactly the same on all ESXi) :

2020-10-08T09:08:00Z vmauthd[24697538]: Log for vmauthd version=7.0.0 build=build-16324942 option=Release

2020-10-08T09:08:00Z vmauthd[24697538]: Msg_SetLocaleEx: HostLocale=UTF-8 UserLocale=NULL

2020-10-08T09:08:00Z vmauthd[24697538]: Could not expand environment variable HOME.

2020-10-08T09:08:00Z vmauthd[24697538]: Could not expand environment variable HOME.

2020-10-08T09:08:00Z vmauthd[24697538]: DictionaryLoad: Cannot open file "/usr/lib/vmware/config": No such file or directory.

2020-10-08T09:08:00Z vmauthd[24697538]: DictionaryLoad: Cannot open file "~/.vmware/config": No such file or directory.

2020-10-08T09:08:00Z vmauthd[24697538]: DictionaryLoad: Cannot open file "~/.vmware/preferences": No such file or directory.

2020-10-08T09:08:00Z vmauthd[24697538]: lib/ssl: OpenSSL using FIPS_drbg for RAND

2020-10-08T09:08:00Z vmauthd[24697538]: lib/ssl: protocol list tls1.2

2020-10-08T09:08:00Z vmauthd[24697538]: lib/ssl: protocol list tls1.2 (openssl flags 0x17000000)

2020-10-08T09:08:00Z vmauthd[24697538]: lib/ssl: cipher list ECDHE+AESGCM:RSA+AESGCM:ECDHE+AES:RSA+AES

2020-10-08T09:08:00Z vmauthd[24697538]: lib/ssl: curves list prime256v1:secp384r1:secp521r1

2020-10-08T09:08:00Z vmauthd[24697538]: Connect from remote socket (VCENTER_IP:54246).

2020-10-08T09:08:00Z vmauthd[24697538]: Connect from VCENTER_IP

2020-10-08T09:08:00Z vmauthd[24697538]: recv() FAIL: 1.

2020-10-08T09:08:00Z vmauthd[24697538]: VMAuthdSocketRead: read failed.  Closing socket for reading.

2020-10-08T09:08:00Z vmauthd[24697538]: Read failed.

2020-10-08T09:08:00Z vmauthd[24697538]: VMAuthdSocketWrite: No socket.

Of course, there is an address instead of 'VCENTER_IP".

It occurs precisely every minute, when the status changes to disconnected on the UI, so I think it is related to heartbeat communications.

VCSA has been restarted, there is no firewall between the VCSA and the ESXi, all living on an almost dedicated 10Gb LAN (the connection loss also occurs for the ESXi the VCSA is running on).

As it seems to be only impacting management, VMs are working flawlessly and as long as a command is not issued when the problem occurs, it was possible to live with that.

But as it is now impacting backup and other appliances, it is a growing problem.

I was not able to find the cause of this, network guys tell me it is ok on their side, so I am trying to find some help here.

Regards

0 Kudos
5 Replies
Lalegre
Commander
Commander

Hey sbmk​,

This guy is actually facing exactly the same issue as you and it was because of unsupported devices: Adding ESXi host to vCenter fails

Do you know if all your components on the hosts are on the HCL to support vSphere 7?

0 Kudos
sbmk
Contributor
Contributor

Thanks for your reply.

I have already read this post, and in the end he explains it was an IP address conflict in fact. Fortunately, I am not facing that.

Our servers are stock Dell PowerEdge R740, using the Dell certified ESXi image. As far as I know, they are totally compliant.

0 Kudos
Lalegre
Commander
Commander

Yes that was the issue of one guy and then the disk issue also.

What do you see in the vmkernel.log?

0 Kudos
ZibiM
Enthusiast
Enthusiast

Every minute ?

Check firewall rules between vcenter server and esxi hosts

You seem to do not have udp 902 from ESXi hosts to vcenter allowed through firewall

902 udp is heartbeat that each and every ESXi host sends every 60s towards vcenter server.

If the vcenter does not receive it, it marks host as disconnected.

0 Kudos
sbmk
Contributor
Contributor

Thanks for your answer.

Yes, I know that because we first had a firewall between the VCSA and the ESXi and some wrong rules caused problems : the host were connected, but after a reboot for an example, they were disconnected and unable to come back to the right status.

We realized a firewall failure could prevent us from easily managing our infrastructure, so all ESXi and the VCSA are now in the same network, with no firewall between them.

And then this new regular fault appeared ...

0 Kudos