VMware Cloud Community
salasos3
Enthusiast
Enthusiast

Unresponsive ESXI & VMs

Hi all,

Last week we had an ESXI host that disconnected from the vCenter and was completely unresponsive and same for the VMs(disconnected state and not pingable) running on it, HA did not restarted the VMs on another host and any alert was triggered so we did not received any incident for about 3.5 hours. After 3.5 hours we did receive an incident for the VMs reboot(by HA) so that is when we became aware of the issue.

Looking at the ESXI logs I see that the host was reconnected and became responsive again after 3.5 hours, HA responded as well so the VMs were restarted onto another host and a alarm was triggered alerting about the VM restart.

have you seen something like this?

I'm wondering what could have caused the ESXI & VMs to go unresponsive to the point that not even HA responded or became aware that the VMs needed to be restarted into another ESXI host and no alarms were triggered?

-Alarm action is enabled.

-Vm down/diconnected alarms are active.

-ESXI down/disconnected alarm is enabled.

-I dont have vmkernel.log, hostd.log etc as they were rotated the same day at night

-This is what I have so far:

***************************First Lost access to volume*************************

Date Time: 03/01/2021, 1:49:05 PM
Type: Information
Target: ************
Description: Ask VMware...
03/01/2021, 1:49:05 PM Lost access to volume 5****08a-a34*****-*ad7-a03******e10 (DatastoreName
) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
Related events:
There are no related events.


Description:
03/01/2021, 1:53:04 PM Bursting event esx.problem.vmfs.heartbeat.recovered occurred 5 times since Monday, March 1, 2021 7:51:37 PM UTC
Related events:

Description:
03/01/2021, 1:53:04 PM Event burst of esx.problem.vmfs.heartbeat.recovered ended



**********Then we see a bunch more of Lost access to volume, more than 100 of these***************************

Description: Ask VMware...
03/01/2021, 1:53:05 PM Lost access to volume 5a**9530-c9****04-0**8-a036******10 (DatastoreName
) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Description: Ask VMware...
03/01/2021, 1:53:05 PM Successfully restored access to volume 5a**9530-c9****04-0**8-a036******10 (DatastoreSName) following connectivity issues.


Date Time: 03/01/2021, 1:58:04 PM
Type: Information
Target: ************
Description:
03/01/2021, 1:58:04 PM Event burst of esx.problem.vmfs.heartbeat.recovered ended


*****************The host becomes unresponsive********************************


Date Time: 03/01/2021, 2:30:58 PM
Type: Error
Target: ************
Description:
03/01/2021, 2:30:58 PM Host ************
in DatacenterName
is not responding
Event Type Description:
Connection to the host has been lost
Possible Causes:

The host is not in a state where it can respond


Date Time: 03/01/2021, 2:30:58 PM
Type: Information
Target: ************
Description:
03/01/2021, 2:30:58 PM Alarm Host connection and power state on ************
changed from Green to Red
Related events:


*******************Vsphere HA was unresponsive*****************

Date Time: 03/01/2021, 3:39:19 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:39:19 PM The vSphere HA availability state of the host ************
in cluster in ClusterName
in DatacenterName has changed to Unreachable!
Event Type Description:
This event is logged when the availability state of a host has changed.


Date Time: 03/01/2021, 3:39:28 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:39:28 PM The vSphere HA availability state of the host ************
in cluster in ClsuterName
in DatacenterName
has changed to Host Failed
Event Type Description:
This event is logged when the availability state of a host has changed.


Date Time: 03/01/2021, 3:39:28 PM
Type: Error
Target: ************
Description:
03/01/2021, 3:39:28 PM vSphere HA detected a possible host failure of host ************ in cluster ClusterName in datacenter DatacenterName
Event Type Description:
This event is logged when vSphere HA detects a possible host failure.


Date Time: 03/01/2021, 3:39:28 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:39:28 PM Alarm vSphere HA host status on ************
changed from Gray to Red


************HA becomes responsive again, arround the time when the host and VMs became responsive******************

Date Time: 03/01/2021, 3:45:52 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:45:52 PM vSphere HA agent on host ************ in cluster ClusterName in datacenter DatacenterName is healthy


Date Time: 03/01/2021, 3:45:53 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:45:53 PM Alarm vSphere HA host status on ************
changed from Red to Green


*************The host recovered connectivity to the VCenter.********************************

Date Time: 03/01/2021, 3:45:58 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:45:58 PM Connected to ************
in DatacenterName

Date Time: 03/01/2021, 3:45:58 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:45:58 PM Alarm Host connection and power state on ************
changed from Red to Green


************Networking alarms making reference to the external/physial switch*************************

Date Time: 03/01/2021, 3:46:39 PM
Type: Error
Target: ************
Description:
03/01/2021, 3:46:39 PM Not all VLAN MTU settings on the external physical switch allow the vSphere Distributed Switch maximum MTU size packets to pass on the uplink port 408 in vSphere Distributed Switch on host ************
in DatacenterName
.


Date Time: 03/01/2021, 3:46:39 PM
Type: Error
Target: ************
Description:
03/01/2021, 3:46:39 PM Not all the configured VLANs in the vSphere Distributed Switch were trunked by the physical switch connected to uplink port 408 in vSphere Distributed Switch on host ************
in DatacenterName
.

Date Time: 03/01/2021, 3:46:39 PM
Type: Error
Target: ************
Description:
03/01/2021, 3:46:39 PM Not all VLAN MTU settings on the external physical switch allow the vSphere Distributed Switch maximum MTU size packets to pass on the uplink port 409 in vSphere Distributed Switch on ************
in DatacenterName
.

Date Time: 03/01/2021, 3:46:39 PM
Type: Error
Target: ************
Description:
03/01/2021, 3:46:39 PM Not all the configured VLANs in the vSphere Distributed Switch were trunked by the physical switch connected to uplink port 409 in vSphere Distributed Switch on host ************
in DacenterName
.


******************VMnics flapping and recovering connectivity****************************

Date Time: 03/01/2021, 3:46:45 PM
Type: Warning
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic0 linkstate is down.

Date Time: 03/01/2021, 3:46:45 PM
Type: Information
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic0 linkstate is up.
Related events:

Date Time: 03/01/2021, 3:46:45 PM
Type: Warning
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic1 linkstate is down.

Date Time: 03/01/2021, 3:46:45 PM
Type: Information
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic1 linkstate is up.
Related events:

Date Time: 03/01/2021, 3:46:45 PM
Type: Warning
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic2 linkstate is down.

Date Time: 03/01/2021, 3:46:45 PM
Type: Information
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM Physical NIC vmnic2 linkstate is up.


*********************The host recovers DS connectivity.***********************************

Date Time: 03/01/2021, 3:46:45 PM
Type: Information
Target: ************
Description: Ask VMware...
03/01/2021, 3:46:45 PM File system DSName, 5****f63-d33****-*5e6-a03**ed3e10 on volume 5a568f61-7ce7cfa0-a040-a0369fed3e10 has been mounted in rw mode on this host.
Related events:


Date Time: 03/01/2021, 3:49:48 PM
Type: Warning
Target: ************
Description:
03/01/2021, 3:49:48 PM Event burst of esx.audit.vmfs.volume.mounted started

Date Time: 03/01/2021, 3:50:34 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:50:34 PM Bursting event esx.audit.vmfs.volume.mounted occurred 44 times since Monday, March 1, 2021 9:49:45 PM UTC

Date Time: 03/01/2021, 3:50:34 PM
Type: Information
Target: ************
Description:
03/01/2021, 3:50:34 PM Event burst of esx.audit.vmfs.volume.mounted ended
Related events:
There are no related events.
Labels (1)
Reply
0 Kudos
1 Reply
nachogonzalez
Commander
Commander

Hey, hope you are doing fine.
Can you share your HA configuration?


Reply
0 Kudos