I ran into an issue yesterday with NFS connectivity to my NetApp backend SAN. I had recently changed some NFS related advanced settings to NetApp recommended values on all cluster hosts (heartbeatdelta and heartbeatmaxfailures), so I attempted to back those settings out on one host and reboot it thinking that might have fouled something up. It turned out that it did not make a difference and created a new problem; with the NFS datastores unavailable, I was unable to get the ESXi 4.1 host to connect back to vCenter. The host eventually threw the error below and remained 'disconnected' in vCenter 4.1.
The error I recieve when the host booted back up was "A general system error has occured: internal error: vmodl.fault.HostCommunication"
It turned out that our storage admin had modified a setting on the virtual interface on the SAN which apparently did not jive with the way the switch was configured, hence the hosts were able to successfully create the connection to the NFS datastores but there was significant latency when trying to read or write to those datastores. Latency might not be the right word, it took over 5 minutes just to view the contents of a directory when browsing the datastore from the host. This seemed to cause all kinds of problems with ESXi, and apparently it made the host timeout when trying to reconnect to vCenter.
Has anyone else seen an issue like this? Is reconnecting a host to vCenter dependent upon datastore latency? I had assumed that it would connect and simply be unable to access the datastores. It's fixed now (by correcting the SAN settings), just looking for why it happened.
Is reconnecting a host to vCenter dependent upon datastore latency?
It's 2 independent programs. There is an agent installed on ESX that allows vCenter to connect. The host has a separate vkernel configured on some vSwitch to allow NFS / iSCSI datastore connections. They aren't really related, unless they are both on the same vSwitch (management console and vkernely) but that would be a switch configuration issue.
I would login to the host directly with VI Client and check the logs of that host. If you can't connect to the host, I would suspect NIC connections. Did you enable jumbo frames?
Yes, jumbo frames are enabled on the pswitch, vswitch and on the vmknic. Changing the MTU size between the pswitch and each FAS head is what caused issues, I believe. I'd have to verify.
The management connection and vmknic used for storage are not on the same vswitch and while it was not able to connect to vCenter I was able to connect directly to the host using the VI client. I've been under the impression that, like you said, connecting to vCenter and unavailable NFS datastores should not be related. However, as soon as the NFS connection was restored I was able to connect the host to vCenter with no errors. That seemed out of the ordinary.
Changing the MTU size between the pswitch and each FAS head is what caused issues, I believe. I'd have to verify.
Yeah, that's why the question, I suspected as much. MTU 9000 is really jumbo frames, but if your switches don't support it, you will have problems. So at least you got it resolved.
Also there is some differing opinions (me included) on whether or not higher MTU really do much to SPEED, less packets, but that's switch utilization problem NOT NIC problem. Newer switches have higher bandwidth, which solves the problem.. so you probably don't even need jumbo frames.
However, as soon as the NFS connection was restored I was able to connect the host to vCenter with no errors. That seemed out of the ordinary.
Timeout settings on ESX are set (by default) very high like 10 minutes or 1440 seconds. Not sure who qualified these as "OK", but they obviously don't do administration on a daily basis. Networks should get INSTANT response. If it takes longer than 10 seconds to do anything (not complete, just respond) that's a problem. You should have to wait minutes for a storage to come back, it's either there or it isn't. 10 seconds should be quite enough, because by then if you aren't getting a response, well then the VM's relying on those connections are already dead.... Houston we have a problem.
So you can investigate changes of those settings so you will get a much quicker response, but ESX was most likely waiting for response (which should not tie up the host).
The thing is, the switches do support jumbo frames. The issue turned out to be between the LACP grouping and MTU settings between the NetApp heads and the Juniper switches, the LACP domain on the other side of the connection. From the ESXi host to the switch, everything was fine. The only problem was that the NFS datastores were timing out, ESXi doesn't know about the LACP configuration past the physical switch.
I'm still not sure why an NFS datastore timeout would prevent a host from successfully connecting to vCenter. Does vCenter connectivity emunerate datastores on the host and time out if they have an extremely long latency?