The routing table with in ESXi takes precedence over what ever you configured within your portgroups.
This was not an issue in ESX as the Service Console and vmkernel had their own seperate routing table to deal with..
And I do not believe ICMP will reveal such an issue.... So ping'in may seem fine..
Try to configure your vMotion vmk interface with a different subnet, and Im sure its all good...
/Rubeck
I don't believe it is changable on the ESXi host, however I suspect your problem maybe upstream if you are having problems contacting hosts after they have been vMotioned.
As a vMotion occurs the host will send a TCP notification that the MAC is no longer on the original host and the switch will start polling a "Who is?" looking for a MAC response in order to update the CAM table.
I think from the description of your problem that your switch is not accepting the signal and CAM is not being updated, have had this problem myself and it required a software update on the switch stack.
Hope this helps.
Thank you for your help, greatly appreciated. I will forward this to our network team and see what they think.
I was told they have done a wireshark to see what was going on and they saw the following for a vmotion (as I recall):
Host A NIC is talking to the virtual machine NIC on Host B to kick in the vmotion however all the ACKS are returned by Host B NIC.
Not sure if it is releavant but it was like the Host was "spoofing" the vmNIC in order to answer on its behalf.
If something else comes in mind, feel free to add on.
Thanks!
I found an article which relates the same issue:
http://serverfault.com/questions/197918/clearing-arp-cache-on-esxi-4-1
Unfortunately no answers there either...
Defintely sounds network related, do you have some kind of security on the network which is stopping proxy-arps from being propagated?
The hosts effectively proxy-arp on behalf of the virtual machines running onboard them, and when they move the new host acts as a new proxy-arp point.
If the network is seeing this as an attack is 'man-in-the-middle' and the someone is trying to assume the identity of the machine, you may have a security mechanism stopping it.
Usually your network for vMotion is at the backend and not directly addressable to the outside world, so I would check that you have not got and IDS/IPS and network reactions programmed in and running as this could cause the problem until the CAM table ages out the MAC and at which time the new point will be accepted.
Just another thought for you, have you opened a case with VMWare support?
Woot...! that would makes sense, haven't thought about that. I'll ask if some firewall rules are in use which would prevent this to happen.
Also, no we haven't open anything yet on VMWare support as last time they were less than helpful regarding another issue we got related to SRM so we prefer to use them as last resort if possible.
I'll check with the security team to see if they have some rules in place for proxy arps.
Thanks again, I'll keep this forum posted (pun intended!).
I was able to hear back from the network team about security.
It seems like there is no security in place to block anything...
The issue is that the traffic is coming back from the wrong interface on the server.
The question is why the RFP ACKs keep coming back from the service console interface instead of the VMotion interface? That seems to be the problem.
Ideas, thoughts are welcome ![]()
Ok, so no block in place, and traffic returning on the wrong interface.
So I am assuming that you have something like a 3 NIC configuration? 1-Mgt, 1-vMotion, 1-VMachines?
Have you ensured that you have not got vMotion enabled on the mgt NIC, as it maybe selecting this NIC as shortest path?
Running out of ideas, sorry!
The question is why the RFP ACKs keep coming back from the service console interface instead of the VMotion interface? That seems to be the problem.
Ideas, thoughts are welcome
Your vmk interfaces are within different subnets, right? If not this would be the reason..
/Rubeck
We have 6 hosts, some of them have 4 nics, some have only 2.
I checked if vMotion was enabled on the mgt NIC but it wasn't.
No worries, I didn't mean to bang on your brain but thank you very much for your time and ideas.
If I end up getting to the bottom of this I'll post in here ![]()
Doing some drawings of the different hosts, vmkernel and service console I have noticed that 2 out of the 6 server hosts got misconfigured for the teaming nics.
Servers 3 and 4 were using the same active nic for vmotion and management traffic.
I thought, yeah great, finaly! However my coworker said that the traffic that caused the issue was coming from server 1 and 2.
How can it be? I'm puzzled here :smileyconfused:
I'll trigger some vmotion tomorrow morning early to see if there are any improvements and will report here.
As always, any thoughts are welcome. :smileycool:
I believe they are not however I can be wrong. I'll double check and report here. Thank you for the tip Rubeck ![]()
I did some vmotion this morning and bang, flooding.
Interestingly enough, if I keep pinging the vmotion nics, no flooding.
It seems like as far as the vmotion nic is part of the routing table on the switch, everything is fine. As soon as the routing table is cleared and a vmotion occurs, flood follows. :smileyconfused:
Is your vMotion happening on its own network segment and on its own NICs?
The vmotion happens on its own nic however on the same subnet than the management console.
IPs are 192.168.128.X for the service console and 192.168.128.Y for the vmkernel.
I can't explain why it would matter to have the vmotion on its own network segment as the switch doesn't get flooded currently while pinging the vmotion nics continuesly? :smileyconfused:
The routing table with in ESXi takes precedence over what ever you configured within your portgroups.
This was not an issue in ESX as the Service Console and vmkernel had their own seperate routing table to deal with..
And I do not believe ICMP will reveal such an issue.... So ping'in may seem fine..
Try to configure your vMotion vmk interface with a different subnet, and Im sure its all good...
/Rubeck
Thank you Rubeck for your help.
We have changed the subnet to 192.168.255.x for the vmotion nic however we cannot test right away in order to not disrupt the users so more testing will occure tomorrow morning.
I will keep everyone posted tomorrow.
Crossing my fingers... :smileysilly:
So far I was able to run some tests this morning only for a brief vmotion and it seems like everything is fine now. I have re-enabled DRS for the day and will see if it holds on but I believe that changing the subnet to 192.168.255.x for the vmotion nic did the trick.
Thank you Rubeck!
I'll get back to the thread at the end of the day to report status. ![]()
Rubeck, seems like you earned the points, no more issues reported with vMotion, thank you! :smileycool:
However this doesn't tell us the root cause of the issue and it seems like a bug on the VMWare side.
Other thoughts:
One of the things that is going on is that the VMotion interface and service console share a tcp stack if they are in the same vswitch, which they never used to share in pre-5 if I recall, even if they were in the same vswitch.
We think, that, coupled with using the same address space, it is simply allowing the service console to go ahead and respond to the V motion traffic.
Our guess is that changing the subnet on the V motion traffic is what allows the traffic to only be responded to by the V motion interface on that box.
We think that moving the service console into another vswitch on each of the server hosts would do the trick without having to manage two different subnets as that would have the effect of putting each of those service consoles and the Vmotion interfaces in different TCP stacks.
More testing will follow.
Anyway, for now it works.
Thanks everyone!![]()
Happy to hear you've got it working....:-)
It actually doesn't seem like a bug when reading the multi- homing KB here http://kb.vmware.com/kb/2010877
But the KB kind of contradicts the guide for setting up multi-NIC vMotion where multiple vmknics have to be configured with IPs belonging to the same subnet.. http://kb.vmware.com/kb/2007467
Don't really know what the deal is here...
/Rubeck
