VMware Cloud Community
Gaurav_Awasthi
Enthusiast
Enthusiast

VM's restarted by HA

Hello,

We have a cluster of 4 ESXi 4.1 (U2) host.We witnessed a incident where all VM's were powered off and then were restarted by HA on two of the hosts at almost the same time,

However both the ESXi host were up during this time.

We did see some network related issues in the logs like

Apr 22 12:52:11 vmkernel: 2:13:39:28.611 cpu18:864567)NetPort: 982: enabled port 0x4000008 with mac 00:50:56:aa:59:37
Apr 22 12:52:11 vobd: Apr 22 12:52:11.525: 221968611538us: [vob.net.dvport.uplink.transition.up] Uplink: vmnic0 is up. Affected dvPort: 1266/d6 e9 2a 50 f9 a7 bc 81-7f 40 06 49 a6 6e 49 27. 1 uplinks up.
Apr 22 12:52:11 vobd: Apr 22 12:52:11.525: 221968611549us: [vob.net.dvport.uplink.transition.up] Uplink: vmnic1 is up. Affected dvPort: 1266/d6 e9 2a 50 f9 a7 bc 81-7f 40 06 49 a6 6e 49 27. 2 uplinks up.
Apr 22 12:52:11 vmkernel: 2:13:39:28.624 cpu20:864568)NetPort: 2256: resuming traffic on DV port 1261
Apr 22 12:52:11 vmkernel: 2:13:39:28.624 cpu20:864568)NetPort: 982: enabled port 0x5000007 with mac 00:50:56:aa:59:38

but since host isolation response is also set to leave VM powered so please can anyone help as to  what could have caused the VM's to  restart.

Thanks

Reply
0 Kudos
2 Replies
Gaurav_Awasthi
Enthusiast
Enthusiast

We are using NAS based storage for the VM's and and are getting following error for vmnic's which are used for NAS storage:

Apr 19 23:27:53 vmkernel: 0:00:15:07.818 cpu42:4138)MCE: 1367: Status bits: "Corrected DRAM ECC Error on cpu 42 physical address 0x402011d030 "
Apr 19 23:27:58 vmkernel: 0:00:15:13.308 cpu6:4102)NetDiscover: 732: Too many vlans for srcPort 0x5000003; won't track vlan 2676
Apr 19 23:28:24 vmkernel: 0:00:15:38.528 cpu37:4133)CDPThrottled: CdpReviewFrames: CDP packet overrun for uplink vmnic16.
I am also attaching the vmkernel logs file which could provide more info.
Reply
0 Kudos
depping
Leadership
Leadership

Could be various things but most obvious:

Hosts were isolated, both storage network + management network. Because hosts were fully isolated it appears to the rest of the cluster that the host is dead. Reasons are:

1) no heartbeat coming in

2) not responding to a ping to the management interface

3) when trying to restart the VMs the VM files are not locked

So the remaining hosts in the cluster will be able to restart the VMs...