I have a linux guest vm that was reset early 1 morning because of the above HA message after coming out of a VDP backup snapshot but I was wondering if it should have been and what if anything I should do to make sure either my HA settings are appropriate for the cluster or vm or whether I need to make other changes to avoid the issue happening when least appropriate
Below are the events as listed on vSphere for the vm in question. It is very large and there could have been alot of guest disk i/o during the snap to consolidate
The guest doesn't appear to be showing any ill effects I'm just not sure why it was reset
Time | Event Description | Type | My Notes |
---|---|---|---|
01:23:14 am | Task Create virtual machine snapshot | info | Within VDP backup window |
05:31:49 am | Task Remove snapshot | info | |
06:22:37 am | vSphere HA cannot reset this virtual machine | warning | |
06:22:38 am | Alarm vSphere HA virtual machine monitoring error changed from Gray to Red | info | |
06.22:38 am | Alarm vSphere HA virtual machine monitoring error on GUESTVM triggered an action | info | |
06.22:38 am | Alarm vSphere HA virtual machine monitoring error an SNMP trap was sent | info | GUEST VM in DMZ and on different vlan to vCenter or hosts |
06.23:16 am | vSphere HA cannot reset this virtual machine | warning | |
06.23.52 am | Virtual machine disks consolidated successded | info | |
06.23.59 am | Message from ESXiHost: Install the VMware tools package inside this virutal machine | info | vmware tools was already installed and matches host |
06.23.59 am | This virtual machine reset by vSphere HA: VMware Tools heartbeat failure: A screen shot is saved in /datastore/vm/vm-1.png | info | Guest OS was still running in the image, no screen of death |
06.23.59 am | Alarm vSphere HA virtual machine monitoring action changed from Green to Yellow | info |
Given HA & the vm are set up as follows
There was some disk latency warnings before the backup snapshots were created but no loss of paths to either the guest vm datastore or the backup destination was reported
Am I right in thinking that HA shouldn't have been triggered due to the disk I/O from the consolidation of the snapshot even if it was taking a long time?
This is not due to "regular HA" but due to "VM Monitoring" which is a part of the HA cluster config. Normally it should not restart the VM unless:
1) there is no vmware tools heartbeat
and
2) there is no Storage or Network IO.
It is almost like the VM froze completely for a substantial amount of time... which can happen of course with heavy tasks.
Bonjour,
Je suis actuellement en congés jusqu'au 10 mai inclus.
Cordialement.
Would that freeze be as a result of the consolidation of the snapshot as I wouldn't say the guest itself was heavily loaded with its own work at that time of the morning?
and
Does that snapshot consolidation not count as disk i/o for the purposes of vm monitoring or is that because its hypervisor aware disk i/o instead of guest OS aware disk i/o
I wouldn't say there was much guest disk i/o going from the time the task to remove the snapshot was started until when it completed but I would have expected some log data to be written during that time period
There was some firewall block notifications for external ip addresses before and during the consolidation step as well as localhost 127.0.0.1 messages from a local service monitoring process that get logged before the guest reboots. The last message is about 2 minutes before the reboot so the guest itself didn't look frozen.
Would 127.0.0.1 traffic count as network i/o or as its loopback traffic and not routed does it not count for the purposes of vm monitoring
Snapshot consolidation isn't counted as that kind of disk i/o as that isn't billed the to actual VM. It could be frozen due to the consolidation of the snapshot indeed. Did you look at the screenshot that was captured along with the reset? it should be in the VM directory, that at least should show it wasn't bluescreened.
Ya the screenshot looked fine, no bluescreen or warnings
that shouldn't happen... you could set it to "low sensitivity" that way the window it will check if there are heartbeats received is larger. You could also increase the IO monitoring time frame by using das.iostatsInterval. But I would try to change sensitivity first. (low = 120 seconds, high = 30 seconds window)
Are there other ways to troubleshoot a vm-monitoring reset to see if it was due to over sensitivity?
Not really, you can check the log files from HA's perspective and then the log files within the VM... As the GuestOS logfiles should also give you an idea of what was going on at the time. But that is all about I know, but maybe our support team can help if you have all relevant log files.