Solved: connection timeout issues after update to 6.7U3

bleuze · ‎11-19-2019

We have only 2 esxi hosts and they are almost identical, a Dell Poweredge T630 and a Poweredge R730. on Sunday I updated both from esxi6.7 to 6.7U3. I first updated vcenter appliance to 6.7U3, then each remediated each host using the Dell esxi image VMware-VMvisor-Installer-6.7.0.update03-14320388.x86_64-DellEMC_Customized-A00.iso.

After the update, the T630 seems to be working almost normally (with just a few delays) but the R730 host has much bigger communication delays. Thankfully the VMs appear to be running and working properly. Our issues show up with managing the hosts and backups. The symptoms we've seen so far are:

Our backup server (Veeam backup) is USUALLY UNABLE to successfully scan the R730 host for VMs and disks - it times out. About 5% of the time I try, it succeeds
Our backup server is USUALLY ABLE to successfully scan the T630 host for VMs and disks. about 5% of the time I test, it fails.
If I bring up the web based management console to connect directly with the host (not through vCenter server), the first logon attempt always fails (times out), the second attempt always works for the T630 and usually works for the R730 but takes a long time.
sometimes after the web management interface comes up for R730, it does not correctly disply the hostname in the browser tab.
Before the update to 6.7 U3 both hosts would successfully login on the first attempt.
Most of the time the vCenter server appliance (which is currently running on the R730 host) connects to the R730 host.
SOMETIMES in the web management interface for vCenter Server, It displays the VMs for the T730 host but I cannot shutdown or reboot VMs from the interface. If I connect to the VM via Windows Remote Desktop and shutdown the VM that way, then I can usually see in the vCenter interface that the VM has shutdown, but sometimes it will not even show me the updated state of the VMs

I don't know where to start with looking in logs. If someone has suggestion I will post logs here

bleuze · ‎11-21-2019

The issue is resolved. The disk enclosure error messages in the vmkernel logs was a distraction. it was unrelated. The issue was that I was running Dell OpenManage on the hosts. The instant I uninstalled OpenManage everything cleared up. This is a known vmware issue: VMware Knowledge Base

View solution in original post

bleuze · ‎11-19-2019

I've attached vmkernel.log from out R730 host

I see an entry every 10 seconds:

lsi_mr3: megasas_hotplug_work:495: lsi_mr3: Event : Enclosure PD 00(c 00/p0) is unstable

That would be our external enclosure. I will look into disconnecting it to see if that changes anything. First I have to figure out somewhere else to send backups as that is where they are currently stored.

I have not looked in the vmkernel.log before this upgrade so can't say for sure if these messages appeared before. I assume they did because I did notice ever since installing this enclosure that the iDRAC reports:

Communication with Enclosure 0 on Connector 0 of RAID Controller in Slot 7 is intermittent.

about every 10 minutes or so. It is still reporting this, nothing has changed. The disks have always worked though, without issues. This is an old Dell MD1200 connected through a Perc H30 adapter which is older hardware not supported anymore on newer Dell servers, but it works... and didn't seem to cause any issues with esxi 6.7

Even if that is an issue, it doesn't explain why we see some symptoms on our T630 which has no unsupported hardware. I've attached it's vmkernel.log as T630-vmkerne.log

bleuze · ‎11-21-2019

The issue is resolved. The disk enclosure error messages in the vmkernel logs was a distraction. it was unrelated. The issue was that I was running Dell OpenManage on the hosts. The instant I uninstalled OpenManage everything cleared up. This is a known vmware issue: VMware Knowledge Base

All

connection timeout issues after update to 6.7U3