VMware Cloud Community
DrBeau
Contributor
Contributor

Odd problem with vmk network for vMotion

UPDATE: Problem has been fixed, but I wouldn't say the mystery has been solved. Please see my entry near the end of this thread for further details.

summary at bottom

Current setup:

H/W: 4x Cisco UCS B200 blades with VIC (Palo)

S/W: vSphere 5 (ESXi 5.0u1, vCenter 5.0)

vmKernel setup:

vmk0 - Management Network (mgmt traffic only) - vSwitch0 - vmnic0, vmnic5; no vlan tagging (on Network 2); IP: 10.x.x.101...104

vmk1 - vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.1...4/24

vmk2 - FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.64...67/24

*note: 10.a.a.x/24 is not a routed network

Problem and steps already taken:

vMotion is not (no longer) working. Any vMotion task fails at 9% with "The VMotion failed because ESX hosts were not able to connect over the VMotion network". Initially, vMotion worked (as evident by the auto-migrations it did the first night). What changed was that Host1 was on 5.0.0 and the other 3 were on 5.0u1. I later blew away Host1 and recreated it from scratch on 5.0u1. I doubt this caused the issue; I just thought it should be put out there.

This cluster is not really in production yet, but I've been vetting it to make sure everything is good before I give it the green light. Of the 4 hosts currently in the cluster, I can not vMotion between any of them. On the tech support console, vmkping can hit the local host vMotion IP but is unable to hit any other host (this applies with the FT log ports as well). I went through the network config once again to establish that all was as it should be. In order to do additional testing, I made a VM Network on the hosts, using as similar config as I could to the vMotion ports (vS1, vmnic4 PRI, vmnic9 SB) and created two VMs on separate hosts). The VMs, assigned IPs 10.a.a.51 and 10.a.a.61 were able to ping the vMotion and FT IPs of their hosts (as I would have expected). The odd thing: The VMs are also able to ping one-another just fine. This seems to establish in my mind that network connectivity is solid...

What am I missing? Any ideas out there?

Summary: vmkping is only able to hit local host's vMotion vmk-IP. VMs on same network as vMotion vmk port can ping one another from separate hosts, but can only still ping their local host's vmk-IPs. I'm confused, since this seems to validate the physical network setup.

Reply
0 Kudos
17 Replies
weinstein5
Immortal
Immortal

Welcome to the Community - one thing sticks out for me -

vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on vlan 3); IP: 10.a.a.1...4/24

FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on vlan 3); IP: 10.a.a.64...67/24

*note: 10.a.a.x/24 is not a routed network

Yes indicate no vlan tagging but than you say it is on vlan3 - so is there vlan tagging or not? Because if the physical switch is configured for vlan3 but the packets are not tagged for vlan3 than traffic will not be transported through the switch - to test assign a vlan tag at vmkernel port for vmotion and see if the vmkping works - if it does that is the issue -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
DrBeau
Contributor
Contributor

Yes, sorry about that. I guess I could have just said Network 2 and Network 3 or something along those lines. They are vlans on our core switches, but the ESX hosts have no idea as the physical links are single, untagged networks. Does that makes sense?

I'll also change the main post to clarify.

Reply
0 Kudos
kjb007
Immortal
Immortal

Did you double-check to make sure vMotion was only checked on the appropriate vmkernel interface?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
DrBeau
Contributor
Contributor

Did you double-check to make sure vMotion was only checked on the appropriate vmkernel interface?

Yup. Each vmk interface has only one role checked.

vmk0 - Management Traffic

vmk1 - vMotion

vmk2 - Fault Tolerance Logging

Reply
0 Kudos
weinstein5
Immortal
Immortal

Does the ESX host connect to the core switch or is there an intermediary switch that it connects? If there is an intermediary do both hosts connect to the same switch? - because if the traffic goes back to the core switch where the vlan is defined then you will need to tag the packets at the switch - so you will have to set the vlan tag at the vmkernel port -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
kjb007
Immortal
Immortal

Your vMotion and FT vmkernel ports are on the same network, can you remove the FT logging vmkernel port, and see if vMotion is working again?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
DrBeau
Contributor
Contributor

Does the ESX host connect to the core switch or is there an intermediary switch that it connects? If there is an intermediary do both hosts connect to the same switch? - because if the traffic goes back to the core switch where the vlan is defined then you will need to tag the packets at the switch - so you will have to set the vlan tag at the vmkernel port -

It's a bit of an abnormal situation due to the Cisco UCS blade chassis. It's difficult to quickly explain if you aren't familiar with it. The blades are managed through a Cisco 6248 Fabric Interconnect, which connects to SAN and network resources. For all intents and purposes, this "switch" is the end-point for the blades. The network in question exists on that switch and is tagged on a trunk going to the core switch (identical to all the other networks I have going to these servers).

Also remember, like I said, I put a Virtual Machine network on the same vSwitch with no vlan tagging, and these two VMs are able to communicate with one another from seperate hosts. They can only ping their local host's vmk1 (vMotion) IP.

Your vMotion and FT vmkernel ports are on the same network, can you  remove the FT logging vmkernel port, and see if vMotion is working  again?

I've tried this, but unfortunately there is no change. In separate steps I removed the IP from vmk2 (Fault Tolerance), then removed the Fault Tolerance tag, and finally removed the interface altogether.

Reply
0 Kudos
rickardnobel
Champion
Champion

DrBeau wrote:


vmKernel setup:

vmk0 - Management Network (mgmt traffic only) - vSwitch0 - vmnic0, vmnic5; no vlan tagging (on Network 2); IP: 10.x.x.101...104

vmk1 - vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.1...4/24

vmk2 - FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.64...67/24

*note: 10.a.a.x/24 is not a routed network

Do I understand the setup such as you have the same IP network (and just different ranges) for these vmkernel networks? If so, there is probably connectivity issues coming from that (internal IP routing takes precendes over vMotion checkbox when connecting the the other hosts).

You should try to separate these the functions on three different IP subnets.

My VMware blog: www.rickardnobel.se
Reply
0 Kudos
logiboy123
Expert
Expert

One troubleshooting step would be to log into the console of each host and perform a vmkping on the management layer. This will help you to see if the hosts can see each other properly. Incorrectly configured DNS is often an issue with vMotion failures, I would also confirm that this is functioning as expected.

kb.vmware.com/kb/1003728

Regards,

Paul

Reply
0 Kudos
DrBeau
Contributor
Contributor

One troubleshooting step would be to log into the console of each host and perform a vmkping on the management layer. This will help you to see if the hosts can see each other properly. Incorrectly configured DNS is often an issue with vMotion failures, I would also confirm that this is functioning as expected.

Here are the results of the vmkping (from Host2)

# vmkping host3.company.inc

PING host3.company.inc (10.0.0.103) [This is Host3's management IP]

(I get three responses)

#vmkping 10.1.1.3 [This is Host3's vmknic1 (vMotion) IP]

(I get three failures)

Reply
0 Kudos
a_p_
Leadership
Leadership

Just to make sure you didn't miss rickardnobel's previous post.

You should use different networks for Management, vMotion and Fault Tolerance.

André

Reply
0 Kudos
DrBeau
Contributor
Contributor

Just to make sure you didn't miss rickardnobel's previous post.

You should use different networks for Management, vMotion and Fault Tolerance.

Yes, they are separate (or mostly were, now they are separate).

(these are not my actual networks, but representative)

* Management is on 10.0.0.x with a 16-bit mask (255.255.0.0). It is fully routed and is used by many other management-type things besides VMware. DNS for the hosts point to these IPs.

* vMotion is on 10.1.1.x with a 24-bit mask (255.255.255.0). This is a completely isolated network (unrouted). Currently, the only things on this network are my vmknics and the 2 testing VMs I created just for pings/testing in this network.

* Fault Tolerance was on 10.1.1.x/24 as well. Currently, I've completely gotten rid of all the FT vmknics. Once I get vMotion working again, I will put FT on a new network (like 10.1.2.x/24), but at this point all I care about it vMotion.

Reply
0 Kudos
MagnetBoy
Enthusiast
Enthusiast

check firewall:

  1. host configuration
    1. Security Profile under Software
    2. click on properties under firewall
    3. make sure vMotion is checked for all Hosts!

One question: did you vMotion from the previous v5.0 to the v5.u1 ones?

I have a UCS working perfectly with EMC VNX5300

:smileyplain:

VMware Certified Professional – Datacenter Virtualization (vSphere 5)
Reply
0 Kudos
DrBeau
Contributor
Contributor

check firewall:
  1. host configuration
    1. Security Profile under Software
    2. click on properties under firewall
    3. make sure vMotion is checked for all Hosts!

Checked, and they're all good.


One question: did you vMotion from the previous v5.0 to the v5.u1 ones?

I have a UCS working perfectly with EMC VNX5300

I have vMotion-ed from the old 5.0 host to the 5.0u1 hosts. I've also vMotion-ed from the 5.0u1 to other 5.0u1 hosts. Honestly, it was as if one day they worked, and the next day they didn't. What's weird to me is that the network side seems to be working. I've never had a problem like this that seemed to be something on the VMware side of things.

Reply
0 Kudos
MagnetBoy
Enthusiast
Enthusiast

Try this:

  1. Delete vSwitch1 in all Hosts
  2. Reboot
  3. Create vSwitch1 in all hosts.
    1. Add the two network adapters nic4 and nic9 to vSwitch1
    2. Add a vmkernel port "VMotion"
      1. vmotion enabled and vlan id
      2. give it the ip address, etc.
      3. NIC teaming
        1. Override switch failover order:
          1. Active vmnic4
          2. Unused vmnic9
  4. try this configuration.
  5. next
    1. overide switch failover order:
      1. Active vmnic9
      2. Unused vmnic4

Each of the ESXi host that are involved in vMotion must meet...

  • Shared storage for the VM files that is accessible by both the source and target ESXi host.
  • Ethernet network interface card with a VMkernel port defined and enabled for vMotion on each ESXi host.

Message was edited by: MagnetBoy

VMware Certified Professional – Datacenter Virtualization (vSphere 5)
DrBeau
Contributor
Contributor

Update: The problem has been fixed, but I wouldn't say the mystery has been solved. Sadly, none of the solutions in this thread worked. Here's what I did to get it working: Thanks for the ideas, though.

  1. I rebuilt Host1, Host3, and Host4 from a base ESXi 5.0u1 install. I'm booting from SAN using thin volumes that are based on a master gold volume (this may be important). I did not rebuild Host2 because was the first 5.0u1 host I created (it's also where my few VMs in this cluster are running). *note:* Host2 uses it's own thin volume based on the master. It is not, itself, the master volume.
  2. After booting the freshly cloned OSes for hosts 1, 3, and 4, I performed a System Configuration reset (bottom command on the ESXi console). I have a feeling this is the important step.
  3. I proceeded to create each host with identical configurations to Host2 (as was done before).
  4. Once fully setup vMotion between all hosts worked (and has been working) fine.

Thanks for much of the help with this thread. Due to the suggestions here, I've created a new network for Fault Tolerance traffic that is seperate from my vMotion network.

Reply
0 Kudos
MikeMenyalkin
Contributor
Contributor

Had the same issue and tried everything suggested above except rebuilding hosts.

It came up that my Management Network ports had the same MAC addresses. Once the issue was fixed, vMotion works great. Check this our before rebuilding your hosts: VMware KB: vmk0 management network MAC address is not updated when NIC card is replaced or vmkernel ...

Reply
0 Kudos