Solved: Re: Esxi host with vsan went unreachable after pow...

Sreehari41 · ‎03-01-2023

Hi,

Recently had a power outage in our environment, we had 3 esxi hosts added to a vsan cluster and configured with distributed switch and enabled lacp configurations, after the power outage one of the esxi host shows not responding, tried to ping the ip couldnt reach, tried to reach the IDrac port, still couldnt reach it. As the cluster is a remote location could connect monitor to it and check for the configurations.

Kindly suggest on what would be the issue and the solution for it.

RajeevVCP4 · ‎03-09-2023

This is normal behavior after power outage , you need to hard reboot from data center

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you

View solution in original post

TheBobkin · ‎03-01-2023

@Sreehari41, If you cannot SSH to it nor access DCUI via iDRAC then you really have no option other than someone going on site to check this

Sreehari41 · ‎03-01-2023

Spoiler

hi,
Thanks for your reply. An update to the previous post is that the esxi console shows yellow screen with no ip address or other details of the underlying hardware. f2 option and f11 options are showing up. while logging into the shell i can see that the vmkernel (vmk0) is showing false in enable option and all configurations are still there.

Lalegre · ‎03-01-2023

@Sreehari41,

Interesting, switch to shell and run:

esxcfg-nics -l
esxcli network ip interface list

What is the output?

stvdnd · ‎03-01-2023

was there a vCenter on that host that was running of the vSAN ?

if so... that could pretty well be your problem.

i had that same problem more then onces… and finally ended up finding that vCenter was preventing the management interface from being responsive. Appears that vCenter was taking ownership of it... again... it was pretty frustrating.

looked at switches, MTU, VMkernel, vDS... like i have looked EVERYWHERE... could not do anything.

could not ping it... could not do F.A...

when i was logging into the iLO (HPe) and opening a Console, then look at DCUI of the vSphere... i could not even do anything to the configuration of the interface as it was grey out...

all other interfaces were good... i could ping etc... no problem... only the Mgmt one was completely F.U...

since then i run my vCenter of a dedicated Datastore... no more on the vSAN... never had any issue since...

and i do a lot of BACKUPS... like... every time i work on it... and keep a copy of the OVA/ISO very close...

after all... this is why backups are for hehehehe....

hope this helped...

Sreehari41 · ‎03-01-2023

hi, It shows vmk0 as disabled and all the configurations are still there.

Sreehari41 · ‎03-01-2023

hi, Yes as you said it is in a vsan datastore. is there a way to fix this problem.

Sreehari41 · ‎03-01-2023

hi,

Yes as you said it is in a vsan datastore. is there a way to fix this problem.

TheBobkin · ‎03-02-2023

@Sreehari41, Can you press Alt+F1 on the ESXi splash screen and indicate what you see logged? E.g. is it indicating a boot device or other hardware issue or something else?

stvdnd · ‎03-02-2023

Hello.

yep... that is exactly what i was seeing as well. anything that i tried to get that shi.... interface up didn't work.. and believe me i have tried EVERYTHING...

i will try to look into my "How to" tonight and come back to you...

that is a stupid situation more especially when your vCenter is on vSAN and that the vSAN stuff is still visible if you log on each host individually. (except the "hung one")

i was able to see all VMDK of the vCenter... across my host. I did try to unregister it from the host that had its "hung" interface... and re-register it on another Host (since the files were spread across).. even this didn't fly...

cheers.

Russell122 · ‎03-02-2023

A stable VMWare ESXI 6.7.0 Host is now unreachable after power outage.

VMWare ESXI Host crashed overnight during unattended 2-hour power outage that exhausted all network UPS after backup generator failed to kick in.

VMWare Host NIC and Switch Port show link and activity, Host can ping itself, but every network ping to/from VMWare Host shows 100% packet loss.

Shell commands from VMWare Host console “esxcli network nic status” and “esxcli network ip interface ipv4 get” show proper status and IP address, yet the Host’s ARP table cache shows no neighbors.

Troubleshooting includes swapping in multiple Cat5 patch cables, using alternate ports (on both switch and host) and even swapping out the Host’s switch, all LED link and system statuses continue to appear normal yet VMWare Host cannot ping, be pinged or see its NAS. Original cable/port/switch have been restored to pre-failure status, exactly as everything was before.

It’s a small network, for design and connectivity, core network includes Firewall, the usual switch gear elements, but this network also relies on a Proxy VM that lives on the failed VMWare Host. So I did wonder if this issue could be a paradox that the firewall/network needs the Proxy, but the Proxy doesnt exist because VMWare can’t reach that boot image on NAS. My guess however would be a No, as the Proxy is primarily an internet focused resource anyway.

This may be a big ask for my first question, but I’m entirely stumped, it just doesn’t make sense. I’m capable at some things, but I‘m still just programming and help desk … and we lost our IT and Security expert … so I’m IT until there’s a replacement.

Thank you all in advance, maybe someone has experienced this situation before, I sincerely appreciate any insight though that would help in bringing our Host back online.

My very best,

Russell

stvdnd · ‎03-02-2023

A stable VMWare ESXI 6.7.0 Host is now unreachable after power outage.

same as me

VMWare ESXI Host crashed overnight during unattended 2-hour power outage that exhausted all network UPS after backup generator failed to kick in.

same as me

VMWare Host NIC and Switch Port show link and activity, Host can ping itself, but every network ping to/from VMWare Host shows 100% packet loss.

same as me

Shell commands from VMWare Host console “esxcli network nic status” and “esxcli network ip interface ipv4 get” show proper status and IP address, yet the Host’s ARP table cache shows no neighbors.

same as me

Troubleshooting includes swapping in multiple Cat5 patch cables, using alternate ports (on both switch and host) and even swapping out the Host’s switch, all LED link and system statuses continue to appear normal yet VMWare Host cannot ping, be pinged or see its NAS. Original cable/port/switch have been restored to pre-failure status, exactly as everything was before.

same as me

It’s a small network, for design and connectivity, core network includes Firewall, the usual switch gear elements, but this network also relies on a Proxy VM that lives on the failed VMWare Host. So I did wonder if this issue could be a paradox that the firewall/network needs the Proxy, but the Proxy doesnt exist because VMWare can’t reach that boot image on NAS. My guess however would be a No, as the Proxy is primarily an internet focused resource anyway.

Not same as me, on my end, any access to Public facing "playground" is all physical. Investigation and TAP tools are VM's which would be bad to loose but would have no impact on potential "unwanted" loss of power.

everything else also runs on NSX

This may be a big ask for my first question, but I’m entirely stumped, it just doesn’t make sense. I’m capable at some things, but I‘m still just programming and help desk … and we lost our IT and Security expert … so I’m IT until there’s a replacement.

You know... sh... happen when its not the time. Also pretty sure that the little line at the bottom of your employment contract (the one that says "and all other relevant task" is biting your a.... now... Been there... just be 100% certain to take notes... .like ... a SH.... load of notes of everything you do, talk about, with who, where, when... like... i am dead serious... DO NOT ... DO ANYTHING IF YOU DONT TAKE NOTES... that could save your "reputation"... You talked about a production env... have everyone see you treat it like if it was your own life... BEEN THERE AS WELL...

Like... its my good friend -->Murphy<-- that had me learn the hard way.

VMware is pretty good up until you cut power without doing the proper shutdown but that... well.. you already know.

Remember that vCenter is the one used to create and manage vDS's... your next potential real challenge will be to re-instate vCenter so it can still manage those vDS's. If you re-create vCenter without a backup... you will have to add the hosts to vCenter which wont use the same vDS's and therefore... be problematic.

From what i have experience, vSAN should still be up in your case (except the host that went "**bleep**... up"). that is a good thing for you who works in your favor. for now.

i would, at the same time that you work on this... have someone else make backup of EVERYTHING and every VM's, if not already done, on your end, you can make backup of the host's configuration by logging in with ssh and "vim-cmd hostsvc/firmware/backup_config". by having backup's, you will not loose everything. (you can even run that command on the faulty host if you have an iDRAC/iLO/IPMI. it will save the configuration locally... and you will have a small amount of time to get it out but.. you can log in and transfer it into another location up until you have access back on Mgmt Interface.

if you have a backup of vCenter, and a copy of the ISO of the version it was running at the moment it died... Like.. up to the version of patch...

mount it on a workstation that has access to a good Host on the network. (like if you would mount an ISO on a Win10 VM). then... proceed with the fresh install on a regular datastore. if any patch were applied to vCenter, apply them first, after its done, restore the backup of vCenter...

at that point, you will be in a much better posture to work toward putting back vCenter on the vDS.

again i will look for my doc and come back a little later.

cheers.

Sreehari41 · ‎03-08-2023

Hi another update to this issue is that even after taking all the precautions for shutting down the hosts, entering maintenance mode, still I face the same issue.

RajeevVCP4 · ‎03-09-2023

This is normal behavior after power outage , you need to hard reboot from data center

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you

Sreehari41 · ‎06-20-2023

Didnt work

Sreehari41 · ‎06-20-2023

no hardware issue as such

Sreehari41 · ‎06-20-2023

The solution is already there, take esxi dcui and remove the vmk0 through cli and reboot the server. This allows us to succesfully reset the management network. After the management is enabled, re assign the ip address and host configurations and add back to vcenter. But the issue is, this happens more often. I am working on a 31 cluster configuration, where after improper shutdows i face this issue in most of the clusters. Need a permanent solution.

All

Esxi host with vsan went unreachable after power outage