After a recent vCenter migration we are seeing the following events in our cluster taks & events:
vCenter Server is connected to a master HA agent running on host ...
vCenter Server is disconnected from a master HA agent running on host ...
These two entries always appear together at just about the same time every 5 minutes. The server in question for each cluster is the master node.
vCenter Version / Build: 5.0.0 623373
Cluster Version / Build: ESX v4.1.0 (702113) / ESXi v5.0.0 (623860)
The clusters all appear to be functioning normally and the events are just info / warning - however it would be nice to get to the bottom of them. We recently migrated our vCenter instance to a new VM on a different subnet. We did go through and ensure that all of the hosts were properly disconnected / re-connected after the IP change.
I have started looking through the local logs on the master nodes in question and have found one entry that appears to align with the timing of the above events is:
vpxa.log:
2012-06-12T10:04:25.374Z [FFCBFAC0 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(local=127.0.0.1:0, peer=127.0.0.1:61618) with error N7Vmacore15SystemExceptionE(Connection reset by peer)
I have read through some KBs regarding issues with DNS servers etc - I have confirmed that all DNS servers are reachable etc (this looks like the localhost address anyways...nost sure what DNS would do with that.)
Anyone see this? -- More importantly, anyone resolve this!?
Hello.
Note: Discussion successfully moved from VMware ESXi 5 to Availability: HA & FT
which datastore did you select for the HA heartbeat? can you try changing the datastore and observe. Sometimes the storage connectivity also can such events to be generated.
Check the HA agent logs on the hosts for errors (/var/log/vmware/fdm/fdm*log)
Elisha
I chose "Select any of the cluster datastores" as the datastore heartbeating option.
Should be noted that I am seeing this environment wide 5 production clusters and 2 test clusters (attached to a different vCenter instance.)
The log snippet I posted above is actually present in both the fdm.log and the vpxa.log (at least it is the only one that seems easily relateable to the event):
2012-06-11T07:43:28.788Z [58A70B90 error 'SoapAdapter.HTTPService'] HTTP Transaction failed on stream TCP(local=127.0.0.1:0, peer=127.0.0.1:59492) with error N7Vmacore15SystemExceptionE(Connection reset by peer)
change the datastore for a check.
I would also suggest filing a support request if you have support on this environment.
Might also want to consider upgrading to ESXi 5.0 U1
Unless I have the build numbers wrong - I believe both of those are 5.0.0 U1 level.
vCenter Version / Build: 5.0.0 623373
Cluster Version / Build: ESX v4.1.0 (702113) / ESXi v5.0.0 (623860)
http://www.vmware.com/support/vsphere5/doc/vsp_esxi50_u1_rel_notes.html
ESXi 5.0 Update 1 | 15 MAR 2012 | Build 623860
http://www.vmware.com/support/vsphere5/doc/vsp_vc50_u1_rel_notes.html
vCenter Server 5.0 Update 1 | 15 March 2012 | Build 623373
(Love the books BTW...)
Hi all,
we had exactly the same issue caused by a firewall between the vCenter and ESXi host. Our default connection keep-alive timeout on the firewall was set to 300 seconds. That's why we had this error every 5 Minutes. It seems as this connection (CIM on port 5989) has to stay open all the time. otherwise you will see the error every 5 minutes. Hopefully this hint will also solve your issues. Sorry for my english, I am not a native speaker....
regards
sn4psh0t
sn4psh0t wrote:
Hi all,
we had exactly the same issue caused by a firewall between the vCenter and ESXi host. Our default connection keep-alive timeout on the firewall was set to 300 seconds. That's why we had this error every 5 Minutes. It seems as this connection (CIM on port 5989) has to stay open all the time. otherwise you will see the error every 5 minutes. Hopefully this hint will also solve your issues. Sorry for my english, I am not a native speaker....
regards
sn4psh0t
That is a very valid point. Nice one and thanks for adding it to this thread, and absolutely no need to apologize... most of us are not native speakers 🙂
by the way, we increased this mentioned value from 300 seconds to 6000 and the error went away. Maybe a lower value than 6000 is also adequate, but I couldn't find time to play around with this.
I think it's a firewall issue. check this. additional note: check the switch and turn off anti d-dos.