I am seeing more and more the following happening in our environment:
Log entries appear in this order:
1. VMs show as disconnecting in the logs
2. host shows as not responding
3. Cannot synchronize host - this is a default alert, yet doesn't show up as triggered in the "triggered alarms"
4. Host connection failure triggered by event 887153 - this is a default alert, yet doesn't show up as triggered in the "triggered alarms" ; and someone please help me with the event ID 887153, can't find info on it anywhere...
5. Host connection failure triggered an action
6. Host connection failure sends snmp trap
Then, 40 minutes later... (sometime only seconds later, sometimes minutes, but in this specific case 40 minutes, but it varies without rhyme or reason)
7. VMs start to connect
8. Established a connection (obviously this refers to the host but this is all you see in the entry)
and that's it...
SO, where is the red to green snmp trap?? or yellow to green you should see described in VMware Knowledge Base ???
the alert by default is supposed to send an snmp trap when the alert changes to green, but nowhere in the logs is this showing that it turns Green. therefore, no trap is being sent to clear the condition to our monitoring system.. and the alerts aren't showing up as triggered, so they aren't being cleared automatically.
I want to know what the underlying cause of these alerts, so it can stop. Also, the references to these alerts being triggered when coming out of standby mode is irrelevant, we don't use standby mode, it has nothing to do with it.
I wonder though, perhaps this is because our blades are in a balanced power mode and not static high performance??? thoughts?
Did this start with VCSA 6.5? And was fine with 6.0? See https://www.reddit.com/r/vmware/comments/7ht23s/vsphere_65u1_cluster_hosts_disconnecting_and/
I have a case in with engineering. Any more info you could provide would be helpful.
Hey Mark, thanks for replying - we aren't using the VCSA for our 6,5 installation, its windows \ SQL, but we do have VCSA 6.02 running.
First, it appears there are endless posts from people experiencing this, this is not isolated. Based on this alone I doubt very much this is DNS or Firewall issues. IF a firewall was preventing a connection to the host well it wouldn't be connected at all or other major issues would be occurring, just look at ports in use between vCenter and the host and you will see what I mean.
I see some differences on our issues, for instances, I'm seeing that your backup is failing as a result of this connection loss.. I haven't seen anything fail from this yet. For us it is more about the false alerts. HA triggers false alerts all the time, to the point where you ignore them regardless, very sad right? so I decided to not leverage the majority of alerts and just stick to the core basics: network, storage, and host connection to vCenter figuring this should cover what we need should something truly happen. but with these host connection alerts happening every night now, this is something I will also have to disable. We do have a few others in place, like snapshot alerts, disk usage, etc... but my experiences working in the trenches has taught me that the VMware alerting just sucks.
This issue wasn't really a problem in our earlier versions of vCenter but I do see posts from people where it is, and there is VM KB articles about it occurring.
I don't mind false alerts if they can be mitigated. for example, every time our backup system mounts a LUN to the host for a hot add backup operation, then pulls that lun gracefully afterward we get alerts triggered "cannot connect to storage"... big bug if you ask me. Bogus. monitoring the storage connection is critical, and yet we get these false alerts, I call them false because the luns are being removed gracefully. unmount and delete a lun and you will see the alert appear... so we got around the alerts by filtering the LUN IDs of our backup system so they don't send off pages or emails to our support staff.
I'm curious, how large is your environment? I wonder if this is a scalability issue, but even with 4 vCPUs running w 3.5% ready time average, and 24GB of memory, I'm finding scalability hard to believe is the cause.
there is a kb about waking up from standby mode, but we don't use it, I never heard of anyone who is, but I guess there are some folks out there that don't have enough money to pay the electric bill hahaha.. I only work for organizations with deeeeep pockets after bad experiences with small org... enterprise organizations don't play with power saving functions..
Which brings up that as a possibility - I would wonder whether having the blade \ server set to a power savings mode could influence this, can't say... we are using the blade power saving mode now, which I have informed the powers that be that this is not a best practice for VMware and we are changing it, but in a large environment this is time consuming work. As we upgrade the hosts to 6.5 I am having the guys get this done.
I will post back when I discover anything new that may be of help with getting to the bottom of this.
and here we go: double whammy this time, 2 alarms going off, but the host in question if fine, performance is great, no issues accessing it and navigating around it..
The host is question is not disconnected, it is responding just fine, good performance if anything.
There is nothing showing in the logs more recent than what you see above nothing that would indicate a triggered alert has been cleared
There is no triggered event showing in the triggered events, was there ever? if yes, how was it cleared?
This is only 1 example of what I see every day, and as a result I have had to disable any notifications for this alert until I can understand the following:
1. what is really happening to trigger this, AKA Root cause
2. What's up with the alarms? not triggering, in fact, it shows above snmp trap when its configured for email notification...