VMware Cloud Community
traknerud
Contributor
Contributor

Wrong management network IP

I'm currently setting up a test environment with six hosts and a small SAN. This is not exactly working as expected, and I would like some advice.

Each host has two dual port NICs. Two ports are connected to the SAN, and two to the office LAN. Both trunked.

During initial setup everything worked fine. I could reach each host from vsphere, add an extra switch for iscsi an so on.

But when i proceeded to add the hosts to vcenter four of them could not be found by name or IP. Nor could I ping them from my management PC.

Checking directly in the console of the host i noticed that the management IP was wrong. For some reason four out of six identical hosts had decided to switch IP to the SAN network, even though the IP was set to static, and managment was not selected/activated when I added the SAN network.

So I changed the IP back and restarted the mgmt network (still using the console). This had a positive effect on one host, while three was still unreachable from vsphere.

I returned (reluctantly) to the server room, and to my surprise I found that the unreachable hosts had the correct IP according to the console. But they remained unreachable, so I finally rebooted all hosts. This changed nothing, and three out of six hosts remained unreachable from vsphere and ping.

Finally I connected my management PC to the SAN switch, and tried to manage the missing hosts using their storage IP. Suprisingly this worked. But I didn't want a setup with half my hosts beeing managed from the SAN network. After a few failed attempts I decided to remove the extra vswitch that I added for SAN access from each host. After all that was when my problems started.

This did of course disconnect the hosts from the SAN network, but had no positive effect as far as management goes. In the end all I acheived was to loose all management connectivity from both available networks. After rebooting the hosts, playing around with which NICs are active, changing IPs etc there is still no change; Three hosts can only be reached from console.

Does anybody know why this happend, and only to some hosts?

Any advice as to how I can get the remaining hosts back online so I can manage them and add them to my vcenter?

All help is appreciated.

Reply
0 Kudos
6 Replies
a_p_
Leadership
Leadership

Welcome to the Community,

please provide some more details about the network configuration, like virtual and physical switches, VLANs, subnets, and how you configured the management network (you said: "Both trunked" - what exactly did you do)?

André

Reply
0 Kudos
traknerud
Contributor
Contributor

Each host is connected to my physical Cisco switch which gives access to a SAN using two ports. On the swith all ESX hosts are in vlan 20. The two connections from each host is in the same LAG.

On the host (while I still could connect with vSphere) I added a standard vSwitch (vSwitch1). I then created a network called "SAN", activated vmotion only and set a compatible IP for my SAN. On the vSwitch I added vmnic2 and vmnic3, and set NIC teaming to "route based on IP hash" and "Link status only" exactly as described in http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100404...

With this config I got one VMKernel port under vSwitch1 and no VM network. The VMKernel port contains my "SAN" network which is using vmk1 with the appropriate SAN IP address.

Each host is also connected to my office LAN. Physically two NIC ports are connected to a ProCurve switch. The corresponding ports on the switch are set in the same vlan. I'm also using "Trunk" (HP style trunking) and of course they are both in the same trunk group. On the host I used vSwitch1, and configured NIC teaming as described above. In addition to the "VM Network" i have a VMKernel port with the management network using vmk0 on the IP I wanted to use to manage the host.

Both these separate networks seem to function fine. There are no errors, and traffic between hosts, servers and clients run as one should expect. Inspecting the logs on the two physical switches shows no problems.

But even though both networks are functioning, I can't get some of the hosts to accept management from the office LAN side. From the SAN it's working fine, but that is not how I wish to run things. Besides the point is not merely to get it working, but to understand why. And right know I can't grasp why this is happening, and why it's only a problem on three out of six identical hosts...

Reply
0 Kudos
tomtom901
Commander
Commander

So, if I understand correctly:

  • Full connectivity is available between all 6 hosts? The hosts share the same IP subnet?
  • Only 3 of the 6 hosts can be managed from the office subnet, which is a different subnet than where the hosts are located?
  • The other 3 hosts cannot be managed from the office subnet, but they can be managed from the SAN subnet? Can the other 3 hosts also be managed from the SAN subnet?

Didn't you (by accident) configure multiple default gateways? If you have access to the DCUI (yellow screen), you can reset the entire host (or network) configuration there, which will restore the network settings back to one simple VSS with a VMkernel port enabled for management. Something worth trying is to remove the trunking on the HP switch and setting the loadbalancing policy on your ESXi hosts to Use Originating Port ID, just as a simple testing mechanism. No switch configuration needed.

Also, do you run vMotion traffic over the same network as the SAN traffic? I'd advise against that, but that isn't the issue.

traknerud
Contributor
Contributor

Your understanding is almost correct. The so called office subnet is the same as where the hosts are located, and yes I can only mange three hosts from that network. The other three was managable from the SAN network, until I removed the virtual SAN switch from them hoping that I could force managment back to the so called office network. That failed....

I did try to reset entire host, but no success.

Anyway. I have now diconnected all NICs but one from the three troublesome host. This gave me once again access to manage them using the office network IP. This indicates thet the trunking was wrong, but I can't see how since it's identical to the working hosts.

I then reconnected the second port in the office network trunk. Restarted the management network, and had a successfull ping. Finaly I once again added the second vSwitch with the SAN network.

So far everything i working. But I don't know for how long since the reason for my previous failure is still unknown. As far as I can tell every setting is the same as before, I've just done it twice.

On possible reason I can think of is that vmk0 and vmk1 some how got switched on the "problem hosts". This cannot be verified since I've deleted the whole original second vSwitch frome these hosts and added it again.

Another possibillity is some small setting on the vSwitch for NIC teaming. I'm almost sure I did the exact same thing on all six, but since the physical switches are unchanged and everything works now I could be wrong.

I don't intend to run vmotion on the SAN network. It was just activated to see if it had any impact on management. Sorry for beeing unclear on that.

Reply
0 Kudos
tomtom901
Commander
Commander

Great to see that you got it working, as far as the teaming concerns, I wouldn't do this on the HP switches. The times that I've dealt with it, it caused more issue than good, so if you just configure it as 2 simple network ports on the switch side, ESXi can load balance that traffic itself when you configure both vmnics as active uplinks.

Reply
0 Kudos
traknerud
Contributor
Contributor

After reconfig of the networks everything worked fine, until today when I added antoher iscsi share. Two host suddenly disconnected from the vcenter and refused to get back online. They were two of the three problem hosts mentioned earlier. I ended up rebooting them from console, after which I could access them directly using vsphere but still not frem vcenter.

That was when I noticed a difference in the network config that I missed previously. I had focused on setting NIC teaming under the vswitch, ignoring the possibility of having different teaming options for the management network. All three problem hosts had different NIC teaming settings than the vSwitch they belonged to. Since this was the only difference between my hosts that I could find I decided to change the setting to make them match the vswitch settings on all six of them.

The config I ended up with for NIC teaming on both the vswitch and the management network was:

Load balancing - route based on IP hash

Network failover detection - Link status only

Notify switches - yes

Failback - no

Override switch failover order - unchecked

With this setup my hosts have been running stable for six hours now. I've stressed them with vmotion, adding and moving datastores, installing vms etc. I still can't be sure if this final change has solved the problem, and I've found no documetation which describes NIC teaming on the management network to verify my settings. But so far so good.

If this continues I guess the problem is solved, and the only question that remains is why these settings were different between the host considering that I set them up in the same manner and made no changes to this specific config until today.

Reply
0 Kudos