Solved: ESXi 6.0 Network Connectivity Issue

psyker6 · ‎03-18-2016

Hi all,

Having an odd problem here. In a nutshell, I have a certain environment and am reconfiguring the vNICs on the hosts. They are all part of a UCS suite, and I took the pre-existing service profile template, made a clone, and modified the NICs accordingly. To get the vmnics to structure correctly in vCenter, I removed the hosts from the DvSwitches, dropped out of vCenter and reset the system config and reconfigure before adding it back in. So far, this has worked fine for four of my hosts; the profile seems fine, everything pings, and the VMs are humming. And then there's this guy- I did the same steps as before, but after reconfiguring the host, I can't get the sucker to ping (and ofc add into vCenter). I have tried everything! I am 100% certain I have the correct management adapters selected and that the network configuration is the same as before. I have restarted the management network, rebooted the host, tried one management NIC at a time (have two for teaming), done a network restore (settings), applied the OLD service profile and reconfigured, put another host in maint mode/shutdown and used IT'S network configs, reinstalled ESXi on the host... all to no avail (Also, confirmed with nslookup that the DNS records are there and correct). To me, it seems like a physical NIC on the blade server or the chassis interface is faulty, but here's the curveball - When I run a steady ping on the host and restart the management network, the host will ping successfully once or twice in what appears to be the time between the management network stopping and then restarting. Additionally, when running the Management Network test, the gateway and both DCs Fail but the hostname resolves. Any ideas so I can get this thing back in production?

Many thanks,

Grant

psyker6 · ‎03-21-2016

Thanks for your responses. I tried scouring the vmkernel.log file for any answers or indications for why the management adapters were failing, but I couldn't locate anything concrete... plus, I wasn't totally sure what I was looking for. Additionally, I first tried to re-acknowledge the blade, and then I completely decommissioned the blade and re-discovered, both without success. However, I was finally able to resolve the matter by resetting the MAC addresses on the two management adapters. I found this in a roundabout way, by creating a clone of the service profile and adding a third management NIC. When configuring them in ESXi, having all three selected allowed the host to ping... then trying one NIC at a time, one adapter failed every time, one was sporadic, and the third (new) adapter pinged successfully every time. So, like I said I bound the old template (with two adapters) back to the blade profile, reset the MAC addresses, and both worked fine. Now the question looms as to why THOSE MAC addresses wouldn't work. I've verified they weren't attached to any standard/distributed switch ports... I'm just concerned that down the road if a new blade pulls those MACs, if they are going to be 'orphaned' somewhere on the network and cause more issues. Within the ESXi SSH, I did a 'find /vmfs/volumes -type f -iname "*.vmx" -exec grep -im1 "MY_MAC" {} \; -print and it listed out pretty much every VM's .vmx file in the cluster with an error, "Device or resource busy." Not sure if somehow the old MACs got hard-set inside the master-image (it's a VDI environment) and propogated during a recompose or what? Any insight would be helpful, but ultimately, the issue is resolved [for now]...

Thanks again,

Grant

View solution in original post

vijayrana968 · ‎03-18-2016

Please check /var/log/vmkernel.log on esxi host for clue whats happening there at time of disconnect.

BluIT · ‎03-18-2016

Have you tried anything from the UCS console, I know it may sound strange, but have you tried to re-discover the blade? Then apply the profile back onto the blade made from the Template. Also are you using the UCS ENIC drivers for ESXi?

psyker6 · ‎03-21-2016

Thanks for your responses. I tried scouring the vmkernel.log file for any answers or indications for why the management adapters were failing, but I couldn't locate anything concrete... plus, I wasn't totally sure what I was looking for. Additionally, I first tried to re-acknowledge the blade, and then I completely decommissioned the blade and re-discovered, both without success. However, I was finally able to resolve the matter by resetting the MAC addresses on the two management adapters. I found this in a roundabout way, by creating a clone of the service profile and adding a third management NIC. When configuring them in ESXi, having all three selected allowed the host to ping... then trying one NIC at a time, one adapter failed every time, one was sporadic, and the third (new) adapter pinged successfully every time. So, like I said I bound the old template (with two adapters) back to the blade profile, reset the MAC addresses, and both worked fine. Now the question looms as to why THOSE MAC addresses wouldn't work. I've verified they weren't attached to any standard/distributed switch ports... I'm just concerned that down the road if a new blade pulls those MACs, if they are going to be 'orphaned' somewhere on the network and cause more issues. Within the ESXi SSH, I did a 'find /vmfs/volumes -type f -iname "*.vmx" -exec grep -im1 "MY_MAC" {} \; -print and it listed out pretty much every VM's .vmx file in the cluster with an error, "Device or resource busy." Not sure if somehow the old MACs got hard-set inside the master-image (it's a VDI environment) and propogated during a recompose or what? Any insight would be helpful, but ultimately, the issue is resolved [for now]...

Thanks again,

Grant

All

ESXi 6.0 Network Connectivity Issue