VMware Cloud Community
eldorado1384
Contributor
Contributor
Jump to solution

server not responding made machines invalid and not migrateabel

hello
I have a vcenter cluster of 6 hp ProLiant DL380/DL580 Gen10 server all with ESXi 7.0.2, 17630552
DRS and HA was enabled
one day I logged in to vcenter and one of servers was "not responding", many system migration was going on from all servers to others!, some ok and some failed. I disabled DSR and no new migration started.
After a hour, the server connected again but few of host on this server become "unknown" in status and in front of name was "invalid". but hosts was working well and all services and DBs and websites was alive.
I tried to migrate hosts from this servers, but the operation failed. all hosts with invalid state and even host that was in good shape
01.png

02.png
I must say that hosts that was powered off before this situation, moved successfully to other server

then I logged in ESXi and host was invalid there.
03.png

tried to unregister hosts from vcenter, it failed.
tried to unregister hosts from ESXi, it said unregistered successfully, but the host was there.
tried to unregister hosts from CLI, even there operation failed and nothing happen

then tried to open storage and see whats going on hosts folder, the server with problem, could not browse storage (SAN Storage)
04.png
even in CLI, server could not brows storage, (this is when the hosts are working from same storage and all services are alive)

finally I stopped all services on hosts in troubled server and rebooted ESX, because of HA, all hosts transferred to other server and after a reboot came online in normal status and there was no "invalid" or "unknown" situation.

after reboot of ESX, I tested the troubled server and migrated servers over it and from it with all states of network connectivity (there is 4 network cable connected to each server, every 2 of them connected to a Switch for Ethernet, and same for Data Switch), there was no problem!

this condition, just created a huge work for out team to stop services of hosts on troubled server and reboot ESXi.
and I searched for this, nothing similar found on Knowledge Base of VMWare.

FT is not enabled yet, because in a test, it created a big latency over network connection of test machine. we must test that more.

0 Kudos
1 Solution

Accepted Solutions
eldorado1384
Contributor
Contributor
Jump to solution

So, thanks to everyone who replied (no one)

I tried so hard and could fix this problem.

 

At first I restarted the Management Agent, but then i got this Error opening ESXi Web console:

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000000691a805070] _serverNamespace = / action = Allow _port = 8309)

the fix everyone told, was to reset this services

so I enabled SSH and executed this on host:

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

but no luck. vpxa couldn't start, even it show vpxa on in service lists

so i restarted all services:

services.sh restart

again, no luck, and i couldn't load ESXi web Console

then i tried this:

/etc/init.d/vpxa stop

services.sh restart

yaaa, the problem fixed and i could connect hosts to VCenter again and storage access was OK in ESXi.

 

this problem took me about 2 weeks. maybe this solution could help you.

View solution in original post

0 Kudos
1 Reply
eldorado1384
Contributor
Contributor
Jump to solution

So, thanks to everyone who replied (no one)

I tried so hard and could fix this problem.

 

At first I restarted the Management Agent, but then i got this Error opening ESXi Web console:

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x000000691a805070] _serverNamespace = / action = Allow _port = 8309)

the fix everyone told, was to reset this services

so I enabled SSH and executed this on host:

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

but no luck. vpxa couldn't start, even it show vpxa on in service lists

so i restarted all services:

services.sh restart

again, no luck, and i couldn't load ESXi web Console

then i tried this:

/etc/init.d/vpxa stop

services.sh restart

yaaa, the problem fixed and i could connect hosts to VCenter again and storage access was OK in ESXi.

 

this problem took me about 2 weeks. maybe this solution could help you.

0 Kudos