VMware Cloud Community
drheim
Enthusiast
Enthusiast
Jump to solution

mutiple ESXi 5.1u1 hosts locked up completely. Console stopped working


We have had a few ESXi datacenters running fine for past few years.  We usually upgrade to the latest version about 6 months after the release.  We upgraded from 5.1 to 5.1update 1 about 3 months ago and have not had any issues.  The other day, we had (3)ESXi hosts completely freeze up.  The local console actually stopped working on these hosts.  Some of the problem hosts were in different clusters as well.  On those problem hosts, some VMs continued to run fine and some were not responding at all.  We did have HA turned off, but DRS was not able to help.  Our SAN utilization did spike and probably caused this issue, but we are not sure where it came from.  What is weird is how some hosts were absolutely fine and others, accessing the same storage, were completely frozen at the console.  Even though the storage spike was over and working correctly, we could not enter maintenance mode, on the problematic hosts, and we had to hold down the power button to reboot them.  We do run ESXi from local storage on each server.  If anyone has ever seen SAN issues cause some ESXI hosts to completely lock-up like that, let me know.  It might be common, but it was a first for me.

Thanks,

Dan

0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

Hi Dan,

I have seen SAN issues to make the Host go not responding. and the reason behind.,

Usually what happens during the SAN issues is crash of the Management agents. Most of the times the culprit is the Hostd service on Host,

when the storage is available ESXi hostd will still continue to try to open a connection to the disk device by issuing different commands like read capacity and read requests to validate the partitions tables are set. If SCSI Sense codes are not returned from a device (you are unable to contact the storage array, or the storage array that does not return the supported “SCSI codes”), then the device  hits an situation like an All-Paths-Down (APD) state, and the ESXi host continues to send I/O requests until it times out. resulting in the crash of the Hostd.

Now since only the Managements agents have crashed down,you will not able to connect to them.however the VM's will still be running good. you should be able to ping them and even RDP into some VM's.

Why some host crashed and some did not..? could be the hostd did have a few worker threads left for other I/O leaving the host to be connected.

So in this situation , the best shot we could try is restart the management agents from the DCUI or SSH and see if it brings the host back online.

Hope this was helpful.

Thanks,
Avinash

View solution in original post

0 Kudos
2 Replies
admin
Immortal
Immortal
Jump to solution

Hi Dan,

I have seen SAN issues to make the Host go not responding. and the reason behind.,

Usually what happens during the SAN issues is crash of the Management agents. Most of the times the culprit is the Hostd service on Host,

when the storage is available ESXi hostd will still continue to try to open a connection to the disk device by issuing different commands like read capacity and read requests to validate the partitions tables are set. If SCSI Sense codes are not returned from a device (you are unable to contact the storage array, or the storage array that does not return the supported “SCSI codes”), then the device  hits an situation like an All-Paths-Down (APD) state, and the ESXi host continues to send I/O requests until it times out. resulting in the crash of the Hostd.

Now since only the Managements agents have crashed down,you will not able to connect to them.however the VM's will still be running good. you should be able to ping them and even RDP into some VM's.

Why some host crashed and some did not..? could be the hostd did have a few worker threads left for other I/O leaving the host to be connected.

So in this situation , the best shot we could try is restart the management agents from the DCUI or SSH and see if it brings the host back online.

Hope this was helpful.

Thanks,
Avinash

0 Kudos
drheim
Enthusiast
Enthusiast
Jump to solution

That sounds like a good explanation for what happened.  Thanks.

0 Kudos