VMware Cloud Community
mobcdi
Enthusiast
Enthusiast

vmware tools heartbeat failure: guest reset but should it have?

I have a linux guest vm that was reset early 1 morning because of the above HA message after coming out of a VDP backup snapshot but I was wondering if it should have been and what if anything I should do to make sure either my HA settings are appropriate for the cluster or vm or whether I need to make other changes to avoid the issue happening when least appropriate

Below are the events as listed on vSphere for the vm in question. It is very large and there could have been alot of guest disk i/o during the snap to consolidate

The guest doesn't appear to be showing any ill effects I'm just not sure why it was reset

TimeEvent  DescriptionTypeMy Notes
01:23:14 am Task Create virtual machine snapshotinfoWithin VDP backup window
05:31:49 amTask Remove snapshotinfo
06:22:37 amvSphere HA cannot reset this virtual machinewarning
06:22:38 amAlarm vSphere HA virtual machine monitoring error changed from Gray to Redinfo
06.22:38 amAlarm vSphere HA virtual machine monitoring error on GUESTVM triggered an actioninfo
06.22:38 amAlarm vSphere HA virtual machine monitoring error an SNMP trap was sentinfoGUEST VM in DMZ and  on different vlan to vCenter or hosts
06.23:16 amvSphere HA cannot reset this virtual machinewarning
06.23.52 amVirtual machine disks consolidated successdedinfo
06.23.59 amMessage from ESXiHost: Install the VMware tools package inside this virutal machineinfovmware tools was already installed and matches host
06.23.59 amThis virtual machine reset by vSphere HA: VMware Tools heartbeat failure: A screen shot is saved in /datastore/vm/vm-1.pnginfoGuest OS was still running in the image, no screen of death
06.23.59 amAlarm vSphere HA virtual machine monitoring action changed from Green to Yellowinfo

Given HA & the vm are set up as follows

  • HA Cluster Settings:
    • cluster default vm restart priority: Medium
    • Guest restart priority: High
    • Datastore heartbeat: 2 datastores (1 hosting the guest vm the other hosting the vdp appliance)
  • VM Settings
    • Linux Guest
    • vmxnet 3 vNic connected to DMZ vlan
    • vm version 7:

There was some disk latency warnings before the backup snapshots were created but no loss of paths to either the guest vm datastore or the backup destination was reported

Am I right in thinking that HA shouldn't have been triggered due to the disk I/O from the consolidation of the snapshot even if it was taking a long time?

Reply
0 Kudos
8 Replies
depping
Leadership
Leadership

This is not due to "regular HA" but due to "VM Monitoring" which is a part of the HA cluster config. Normally it should not restart the VM unless:

1) there is no vmware tools heartbeat

and

2) there is no Storage or Network IO.

It is almost like the VM froze completely for a substantial amount of time... which can happen of course with heavy tasks.

KraL
Enthusiast
Enthusiast

Bonjour,

Je suis actuellement en congés jusqu'au 10 mai inclus.

Cordialement.

Reply
0 Kudos
mobcdi
Enthusiast
Enthusiast

Would that freeze be as a result of the consolidation of the snapshot as I wouldn't say the guest itself was heavily loaded with its own work at that time of the morning?

and

Does that snapshot consolidation not count as disk i/o for the purposes of vm monitoring or is that because its hypervisor aware disk i/o instead of guest OS aware disk i/o

I wouldn't say there was much guest disk i/o going from the time the task to remove the snapshot was started until when it completed but I would have expected some log data to be written during that time period

There was some firewall block notifications for external ip addresses before and during the consolidation step as well as localhost 127.0.0.1 messages from a local service monitoring process that get logged before the guest reboots. The last message is about 2 minutes before the reboot so the guest itself didn't look frozen.

Would 127.0.0.1 traffic count as network i/o or as its loopback traffic and not routed does it not count for the purposes of vm monitoring

Reply
0 Kudos
depping
Leadership
Leadership

Snapshot consolidation isn't counted as that kind of disk i/o as that isn't billed the to actual VM. It could be frozen due to the consolidation of the snapshot indeed. Did you look at the screenshot that was captured along with the reset? it should be in the VM directory, that at least should show it wasn't bluescreened.

Reply
0 Kudos
mobcdi
Enthusiast
Enthusiast

Ya the screenshot looked fine, no bluescreen or warnings

Reply
0 Kudos
depping
Leadership
Leadership

that shouldn't happen... you could set it to "low sensitivity" that way the window it will check if there are heartbeats received is larger. You could also increase the IO monitoring time frame by using das.iostatsInterval. But I would try to change sensitivity first. (low = 120 seconds, high = 30 seconds window)

mobcdi
Enthusiast
Enthusiast

Are there other ways to troubleshoot a vm-monitoring reset to see if it was due to over sensitivity?

Reply
0 Kudos
depping
Leadership
Leadership

Not really, you can check the log files from HA's perspective and then the log files within the VM... As the GuestOS logfiles should also give you an idea of what was going on at the time. But that is all about I know, but maybe our support team can help if you have all relevant log files.

Reply
0 Kudos