UPDATE: Problem has been fixed, but I wouldn't say the mystery has been solved. Please see my entry near the end of this thread for further details.
summary at bottom
Current setup:
H/W: 4x Cisco UCS B200 blades with VIC (Palo)
S/W: vSphere 5 (ESXi 5.0u1, vCenter 5.0)
vmKernel setup:
vmk0 - Management Network (mgmt traffic only) - vSwitch0 - vmnic0, vmnic5; no vlan tagging (on Network 2); IP: 10.x.x.101...104
vmk1 - vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.1...4/24
vmk2 - FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.64...67/24
*note: 10.a.a.x/24 is not a routed network
Problem and steps already taken:
vMotion is not (no longer) working. Any vMotion task fails at 9% with "The VMotion failed because ESX hosts were not able to connect over the VMotion network". Initially, vMotion worked (as evident by the auto-migrations it did the first night). What changed was that Host1 was on 5.0.0 and the other 3 were on 5.0u1. I later blew away Host1 and recreated it from scratch on 5.0u1. I doubt this caused the issue; I just thought it should be put out there.
This cluster is not really in production yet, but I've been vetting it to make sure everything is good before I give it the green light. Of the 4 hosts currently in the cluster, I can not vMotion between any of them. On the tech support console, vmkping can hit the local host vMotion IP but is unable to hit any other host (this applies with the FT log ports as well). I went through the network config once again to establish that all was as it should be. In order to do additional testing, I made a VM Network on the hosts, using as similar config as I could to the vMotion ports (vS1, vmnic4 PRI, vmnic9 SB) and created two VMs on separate hosts). The VMs, assigned IPs 10.a.a.51 and 10.a.a.61 were able to ping the vMotion and FT IPs of their hosts (as I would have expected). The odd thing: The VMs are also able to ping one-another just fine. This seems to establish in my mind that network connectivity is solid...
What am I missing? Any ideas out there?
Summary: vmkping is only able to hit local host's vMotion vmk-IP. VMs on same network as vMotion vmk port can ping one another from separate hosts, but can only still ping their local host's vmk-IPs. I'm confused, since this seems to validate the physical network setup.
Welcome to the Community - one thing sticks out for me -
vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on vlan 3); IP: 10.a.a.1...4/24
FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on vlan 3); IP: 10.a.a.64...67/24
*note: 10.a.a.x/24 is not a routed network
Yes indicate no vlan tagging but than you say it is on vlan3 - so is there vlan tagging or not? Because if the physical switch is configured for vlan3 but the packets are not tagged for vlan3 than traffic will not be transported through the switch - to test assign a vlan tag at vmkernel port for vmotion and see if the vmkping works - if it does that is the issue -
Yes, sorry about that. I guess I could have just said Network 2 and Network 3 or something along those lines. They are vlans on our core switches, but the ESX hosts have no idea as the physical links are single, untagged networks. Does that makes sense?
I'll also change the main post to clarify.
Did you double-check to make sure vMotion was only checked on the appropriate vmkernel interface?
-KjB
Did you double-check to make sure vMotion was only checked on the appropriate vmkernel interface?
Yup. Each vmk interface has only one role checked.
vmk0 - Management Traffic
vmk1 - vMotion
vmk2 - Fault Tolerance Logging
Does the ESX host connect to the core switch or is there an intermediary switch that it connects? If there is an intermediary do both hosts connect to the same switch? - because if the traffic goes back to the core switch where the vlan is defined then you will need to tag the packets at the switch - so you will have to set the vlan tag at the vmkernel port -
Your vMotion and FT vmkernel ports are on the same network, can you remove the FT logging vmkernel port, and see if vMotion is working again?
-KjB
Does the ESX host connect to the core switch or is there an intermediary switch that it connects? If there is an intermediary do both hosts connect to the same switch? - because if the traffic goes back to the core switch where the vlan is defined then you will need to tag the packets at the switch - so you will have to set the vlan tag at the vmkernel port -
It's a bit of an abnormal situation due to the Cisco UCS blade chassis. It's difficult to quickly explain if you aren't familiar with it. The blades are managed through a Cisco 6248 Fabric Interconnect, which connects to SAN and network resources. For all intents and purposes, this "switch" is the end-point for the blades. The network in question exists on that switch and is tagged on a trunk going to the core switch (identical to all the other networks I have going to these servers).
Also remember, like I said, I put a Virtual Machine network on the same vSwitch with no vlan tagging, and these two VMs are able to communicate with one another from seperate hosts. They can only ping their local host's vmk1 (vMotion) IP.
Your vMotion and FT vmkernel ports are on the same network, can you remove the FT logging vmkernel port, and see if vMotion is working again?
I've tried this, but unfortunately there is no change. In separate steps I removed the IP from vmk2 (Fault Tolerance), then removed the Fault Tolerance tag, and finally removed the interface altogether.
DrBeau wrote:
vmKernel setup:
vmk0 - Management Network (mgmt traffic only) - vSwitch0 - vmnic0, vmnic5; no vlan tagging (on Network 2); IP: 10.x.x.101...104
vmk1 - vMotion (vMotion traffic only) - vSwitch1 - vmnic4 (primary), vmnic9 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.1...4/24
vmk2 - FaultTol (FT log traffic only) - vSwitch1 - vmnic9 (primary), vmnic4 (stand-by); no vlan tagging (on Network 3); IP: 10.a.a.64...67/24
*note: 10.a.a.x/24 is not a routed network
Do I understand the setup such as you have the same IP network (and just different ranges) for these vmkernel networks? If so, there is probably connectivity issues coming from that (internal IP routing takes precendes over vMotion checkbox when connecting the the other hosts).
You should try to separate these the functions on three different IP subnets.
One troubleshooting step would be to log into the console of each host and perform a vmkping on the management layer. This will help you to see if the hosts can see each other properly. Incorrectly configured DNS is often an issue with vMotion failures, I would also confirm that this is functioning as expected.
kb.vmware.com/kb/1003728
Regards,
Paul
One troubleshooting step would be to log into the console of each host and perform a vmkping on the management layer. This will help you to see if the hosts can see each other properly. Incorrectly configured DNS is often an issue with vMotion failures, I would also confirm that this is functioning as expected.
Here are the results of the vmkping (from Host2)
# vmkping host3.company.inc
PING host3.company.inc (10.0.0.103) [This is Host3's management IP]
(I get three responses)
#vmkping 10.1.1.3 [This is Host3's vmknic1 (vMotion) IP]
(I get three failures)
Just to make sure you didn't miss rickardnobel's previous post.
You should use different networks for Management, vMotion and Fault Tolerance.
André
Just to make sure you didn't miss rickardnobel's previous post.You should use different networks for Management, vMotion and Fault Tolerance.
Yes, they are separate (or mostly were, now they are separate).
(these are not my actual networks, but representative)
* Management is on 10.0.0.x with a 16-bit mask (255.255.0.0). It is fully routed and is used by many other management-type things besides VMware. DNS for the hosts point to these IPs.
* vMotion is on 10.1.1.x with a 24-bit mask (255.255.255.0). This is a completely isolated network (unrouted). Currently, the only things on this network are my vmknics and the 2 testing VMs I created just for pings/testing in this network.
* Fault Tolerance was on 10.1.1.x/24 as well. Currently, I've completely gotten rid of all the FT vmknics. Once I get vMotion working again, I will put FT on a new network (like 10.1.2.x/24), but at this point all I care about it vMotion.
check firewall:
One question: did you vMotion from the previous v5.0 to the v5.u1 ones?
I have a UCS working perfectly with EMC VNX5300
:smileyplain:
check firewall:
- host configuration
- Security Profile under Software
- click on properties under firewall
- make sure vMotion is checked for all Hosts!
Checked, and they're all good.
One question: did you vMotion from the previous v5.0 to the v5.u1 ones?
I have a UCS working perfectly with EMC VNX5300
I have vMotion-ed from the old 5.0 host to the 5.0u1 hosts. I've also vMotion-ed from the 5.0u1 to other 5.0u1 hosts. Honestly, it was as if one day they worked, and the next day they didn't. What's weird to me is that the network side seems to be working. I've never had a problem like this that seemed to be something on the VMware side of things.
Try this:
Each of the ESXi host that are involved in vMotion must meet...
Message was edited by: MagnetBoy
Update: The problem has been fixed, but I wouldn't say the mystery has been solved. Sadly, none of the solutions in this thread worked. Here's what I did to get it working: Thanks for the ideas, though.
Thanks for much of the help with this thread. Due to the suggestions here, I've created a new network for Fault Tolerance traffic that is seperate from my vMotion network.
Had the same issue and tried everything suggested above except rebuilding hosts.
It came up that my Management Network ports had the same MAC addresses. Once the issue was fixed, vMotion works great. Check this our before rebuilding your hosts: VMware KB: vmk0 management network MAC address is not updated when NIC card is replaced or vmkernel ...