VMware Cloud Community
HolySirSalad
Contributor
Contributor

Multiple VM HA resets near simultaneously

Hi there,

I've been doing some digging around but as I am still pretty green to VMware and the inner workings I'm not sure where to focus my attention.

Our environment consists of two clusters w/ HA & DRS in geographically separate sites under one vCenter. We run a mixture of Linux and Windows VMs, everything with the latest VMware Tools. We have HA monitoring for VMs enabled

Last night a ton of VMs were reset due to "VMware Tools heartbeat failure". As I understand it, this is normal behavior if a VM actually crashes - heartbeat stop, the BSOD or kernel panic is captured in a screenshot, and the VM is reset. However in this event I very much doubt this occurred...

2:15 am - site 2/host 1 - linux vm 1 (light database)

2:16 am - site 1/host 3 - linux vm 2 (Debian mirror)

2:17 am - site 2 host 5 - linux vm 3 (web server)

2:17 am - site 2/host 5 - linux vm 4 (vSphere Management Assistant)

2:19 am - site 1/host 1 - linux vm 5 (running Debian template)

2:30 am - site 2/host 1 - linux vm 1 (light database)

2:31 am - site 1/host 3 - windows vm (SAN manager)

2:35 am - site 1/host 1 - linux vm 5 (running Debian template)

This is out of 41 VMs, 12 of which are Windows from 2000 to 2008R2, the rest Linux. So not all VMs, not all hosts, but both sites, and seemingly grouped at similar times. I'd suggest to myself to look into a network event but HA between VMs and the host shouldn't rely on the network at all. These VMs don't even have the same VLANs mapped.

Any advice on where I can start researching this problem would be very much appreciated!

Thanks

Ross

0 Kudos
4 Replies
peterdabr
Hot Shot
Hot Shot

This is really weird...

The hearbeat between a VM with VMwareTools installed and its host does not rely on networking so network outage shouldn't be the culprit.

I would check /var/log/vmware/aam/ on any of the hosts involved in the incident for any clues....

Also,as a precautionary measure, I would update all hosts with the latest patches and make sure all hosts within cluster are at the same patch level. Almost every major update release has some minor/major updates to HA.

Let us know if you discovered anything else in the logs.

Peter D.

0 Kudos
AndreTheGiant
Immortal
Immortal

Very strange issue, on different VM type.

Do you have some scheduled operation at those times?

There is some VUM baseline on guest OS?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
HolySirSalad
Contributor
Contributor

Hi guys,

Thanks for the suggestions. Our VUM is set up only to update the hostsand everything is running ESXi 4.1 and was at least a week or so prior to the incident. I don't see any updates available for 4.1 within the VUM baselines, but I'm not sure if that would automatically show up (I do see updates for the Nexus 1000V for 4.1, however)

I dumped a diagnostic package from one of the hosts to take a peek, but I don't see anything earlier from when the host was last rebooted. We discovered a memory issue that required a few firmware updates on our IBM blades (may have been affected by http://www-947.ibm.com/support/entry/portal/docdisplay?brand=5000019&lndocid=MIGR-5075489), which might have been causing some issues as a couple of backups run in the middle of the night.

I wouldn't imagine the logs get dumped when a host is rebooted unless there was no local disk, but do they vanish when the host is place into maintenance mode? (I'm guessing so as the HA agent is unconfigured)

Thanks for you help!

Ross

0 Kudos
depping
Leadership
Leadership

Couple of basic things here:

1) VM Monitoring doesn't rely on the network. The heartbeat is captured by hostd.

2) If and when VMware Tools fails, VM Monitoring will check if there was any I/O on network or storage over the last 30 seconds. If that is the case the VMs will not be rebooted.

3) look at the events tab of the VM to see what triggered the restart. If it was HA there should be an event stating that it was restarted by HA etc.



Duncan

VMware Communities User Moderator | VCDX

-


Now available: <a href="http://www.amazon.com/gp/product/1439263450?ie=UTF8&tag=yellowbricks-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1439263450">Paper - vSphere 4.0 Quick Start Guide (via amazon.com)</a> | <a href="http://www.lulu.com/product/download/vsphere-40-quick-start-guide/6169778">PDF (via lulu.com)</a>

Blogging: http://www.yellow-bricks.com | Twitter: http://www.twitter.com/DuncanYB

0 Kudos