I am trying to figure out an issue with one of our ESXi host.
I was in vCenter to clear up alerts that were showing up and all of sudden, one of the hosts showed "Not responding".
I hopped on the ESXi host directly and sure enough it was up and running. Everything seemed fine but when I checked the VMs, I noticed I wasn't able to get on the console. Just got a spinning hour glass.
After checking a few things, I tried to disconnect the host and reconnect but it was failing at around 80% and saying it timed out.
Sure enough, now it showed a disconnected state in vCenter so I removed it from inventory and tried to re-add it.
Now it was just flat out refusing the connection saying the host is not reachable. I went back to the host and sure enough, the web GUI is returning the error below:
503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000000d09c704f80] _serverNamespace = / action = Allow _port = 8309)
I was still able to SSH in to the machine and I tried to restart the hostd and vpxa services per VMWare articles but nothing. VPXA was initially failing to restart due to a missing watchdog.pid file but the file is there in the directory.
Restarted all management with services.sh restart and everything restarts but I cannot connect to it via web GUI or vCenter.
I was thinking about shutting down all the VMs but the commands to list the VMs via SSH says connection refused.
Failed to login: Connection refused: The remote service is not running, OR is overloaded, OR a firewall is rejecting connections.
I am at a completely lost as the guest VMs are running fine but I cannot get the host to restart the management consoles or connect via vCenter.
Log into the GuestOS and shutdown each single VM. After that restart the Host (we do it trough iDRAC/ILO because ESXi never performed the reboot for it self in such conditions). Yeah, once per Year we see such a behavior.
It might be possible that the VM will not power off completly even if you shut them down through the OS and than its possible that HA will kicks in if you kill the host.
I strongly recommend you to update your esxi host as you mentioned its running esxi 7 u1 and current update is esxi 7 u2d.
Most of the times issues like these are related to older patches. But the only workaround in your case is to get a downtime windows for the VMs running on that host and shut them down one by one using rdp/ssh from within the guest OS and perform a power cycle of the esxi host. Once it is up, the first thing you'd want to do is apply the latest patch to it.
All the best!!
I see. thank you for that information. Yes, that was the route I was going to as far as shutting down the VMs are concerned.
Weird thing is, after trying everything and was finally going to start shutting down VMs last night. I checked the host URL 1 last time and out of the blue, it was back up and allowing log ins and even the SSH session was taking all the commands that were erroring out before with connection refused...
Such a weird issue. I think the restart of the services just took forever for some reason, even though the status showed it was running.