Hi,
Disclaimer, I do not have in depth knowledge of vmware, so please excuse the wrong wording, misconception, and ignorance in the post below.
The current topology is as follows:
Each esxi 4.1 update 3 host (DL380 G8) is connect to both layer 2 switch.
on each host, the vswitch has the two NICs configured as active/active, with the default NIC teaming method (route based on the originating virtual port ID).
Everything else is default.
The switches are connected via a trunk link (not stacked)
I have two Windows Server 2008 R2 VMs in the same subnet and have enabled ipv6 on them (default setting)
When the two VMs are on the same physical host, ping -6 destination_ipv6_address works (I just use the link local address)
When the two VMs are on different host, ping fails with "destination unreachable" message, which usually means neighbor discovery process fails (analogous to arp in ipv4 where source VM can't get the destination vm's mac address)
When the two VMs are on the same physical host, packet capture shows that neighbor solicitation message is sent over an ipv6 multicast address
When they are not on the same physical hosts, packet capture on the destination VM shows that the destination VM never gets the ipv6 multicast packet.
I then connect two DL380 G8 in a similar way to the switches, and install windows server 2008 R2 directly without virtualisation on them, and ping -6 works perfectly.
My questions are:
- Am I missing a configuration somewhere to allow ipv6 multicast to work? Or even to remove any "logic" and just treat it as a broadcast?
On network switches, you can do this by turning off IGMP which will then treat multicast as broadcast packets.
Though I can't find a similar setting under esxi anywhere.
- I saw an "enable ipv6" option on esxi, but i assume this is only useful if the host itself wants to participate in ipv6 and therefore not applicable to my particular case?
The only similar issue I found from searching is on the link below, which suggest hardcoding the neighbor table on the VMs, which is not ideal.
I can confirm though, hard coding the neighbor table on both VMs work. So problem seems to lie on how esxi vswitches handle ipv6 multicast traffic
Ideas, insights are greatly appreciated
Ed
I'm not sure whether this will really solve your specific problem, but it's worth a try to update the NIC firmware and driver.
Seems like this is a HP NC 331FLR NIC (DL gen8 default 4-port NIC with the BCM5719 chip).
There's no update binary you can run from 4.1, but you can update all firmware components with the current HP Service Pack for Proliant image:
Or boot the server into a live-Linux of your choice and use the Linux update binary:
http://www.hp.com/swpublishing/MTX-ec0e18db6a8e4d978b57aa95d1
These will update the 331FLR NIC to Boot Code version 1.37/NCSI 1.2.37.
Then update the tg3 driver in ESXi with this offline bundle to 3.129d.v40.1:
You need the offline bundle file (BCM-tg3-3.129d.v40.1-offline_bundle-1033618.zip) from this package. You can import that in the vCenter update manager for easier deployment or (probably) install it from the ESXi shell with esxupdate --bundle=/tmp/BCM-tg3-3.129d.v40.1-offline_bundle-1033618.zip
I'm a bit rusty in the ESXi 4.1 CLI department though, you might have to resort to the vihostupdate utility or with PowerCLI Install-VMHostPatch remotely:
- Am I missing a configuration somewhere to allow ipv6 multicast to work? Or even to remove any "logic" and just treat it as a broadcast?
On network switches, you can do this by turning off IGMP which will then treat multicast as broadcast packets.
Though I can't find a similar setting under esxi anywhere.
No, it should work out of the box. The vSwitch makes forwarding decisions based on the layer 2 addresses and shouldn't care about the protocols above.
Can both your VMs communicate via IPv4/layer2-broadcasts when spread across hosts?
I've tested this myself on ESXi 5.1 U1 with two Windows 2008 R2 VMs with the vmxnet3 vNIC, and ICMPv6 Neighbor Discovery for link-local IPv6s works fine on the same host as well as across hosts.
The destination IPv6 of the solicitation message is the usual solicited-node address of the destination, and the layer 2 destination MAC is derived from it and in the IPv6 multicast MAC range prefix 33:33:
- I saw an "enable ipv6" option on esxi, but i assume this is only useful if the host itself wants to participate in ipv6 and therefore not applicable to my particular case?
Correct, that is only for the host management network itself and not related to VM traffic.
Hi,
Thank you for looking at the issue.
Over the weekend, I managed to test the issue on two other environment.
- Current stage environment with problem:
Physical DL380 G8 servers
Catalyst switches
ESXi 4.1 update 3
Failure in inter host ICMPv6
- Dev environment
Physical DL380 G7 servers
Catalyst switches
ESXi 5.5
Success in inter host ICMPv6
- Prod environment
Dl380G6 blades with HP Virtual Connect
Catalyst switches
ESXi 4.1 update 3
Success in inter host ICMPv6
I've also found a possible culprit:
A combination of DL380G8 and vmware:
https://communities.vmware.com/message/2321555
For now, I am creating a startup script on source and target VMs with each other's MAC address.
What worries me is we just purchased DL380 G8 blades that will replace the production blades, luckily the issue only affects ipv6 multicasts.
On all environment I can confirm that ipv4 multicast is working, this is how the first part of "ping -6 target_host" work when resolving DNS over LLMNR
The second part where Neighbor Solicitation message is sent over ipv6 multicast is the broken part.
I'll keep hunting, and I'll try to test this on our prod cluster as well
- Prod environment
Dl380G6 blades with HP Virtual Connect
Catalyst switches
ESXi 4.1 update 3
Success in inter host ICMPv6
we just purchased DL380 G8 blades
A bit confusing here. I suppose you're referring to HP BL blades? A DL (380) system is a rack server and not a blade.
- Current stage environment with problem:
Physical DL380 G8 servers
Catalyst switches
ESXi 4.1 update 3
Failure in inter host ICMPv6
Looking at the other thread, did you make sure the firmware and drivers are up to date? What NICs do you have in your 380 gen8 servers where you experience the problem?
Check and post the output of esxcfg-nics -l and ethtool -i vmnicX for each physical NIC on these hosts.
You can find more recent drivers for ESXi 4.1 here or on the HP support site:
Hi,
MKguy wrote:
- Prod environment
Dl380G6 blades with HP Virtual Connect
Catalyst switches
ESXi 4.1 update 3
Success in inter host ICMPv6
we just purchased DL380 G8 blades
A bit confusing here. I suppose you're referring to HP BL blades? A DL (380) system is a rack server and not a blade.
It is indeed BL380, apologies for the shoddy copy and paste job
- Current stage environment with problem:
Physical DL380 G8 servers
Catalyst switches
ESXi 4.1 update 3
Failure in inter host ICMPv6
Looking at the other thread, did you make sure the firmware and drivers are up to date? What NICs do you have in your 380 gen8 servers where you experience the problem?
Check and post the output of esxcfg-nics -l and ethtool -i vmnicX for each physical NIC on these hosts.
You can find more recent drivers for ESXi 4.1 here or on the HP support site:
Hmm, they are actually DL360 G8s which have the same Broadcom NICs as DL380s
Here are the output:
~ # esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:03:00.00 tg3 Up 1000Mbps Full 3c:4a:92:b2:2b:fc 1500 Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic1 0000:03:00.01 tg3 Up 1000Mbps Full 3c:4a:92:b2:2b:fd 1500 Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic2 0000:03:00.02 tg3 Up 1000Mbps Full 3c:4a:92:b2:2b:fe 1500 Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
vmnic3 0000:03:00.03 tg3 Up 1000Mbps Full 3c:4a:92:b2:2b:ff 1500 Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet
~ # ethtool -i vmnic0
driver: tg3
version: 3.123b.v40.1
firmware-version: 5719-v1.29 NCSI v1.0.80.0
bus-info: 0000:03:00.0
~ # ethtool -i vmnic1
driver: tg3
version: 3.123b.v40.1
firmware-version: 5719-v1.29 NCSI v1.0.80.0
bus-info: 0000:03:00.1
~ # ethtool -i vmnic2
driver: tg3
version: 3.123b.v40.1
firmware-version: 5719-v1.29 NCSI v1.0.80.0
bus-info: 0000:03:00.2
Let me check the drivers, I assume this is something I have to install from the console?
Cheers
Ed
~ # ethtool -i vmnic3
driver: tg3
version: 3.123b.v40.1
firmware-version: 5719-v1.29 NCSI v1.0.80.0
bus-info: 0000:03:00.3
I'm not sure whether this will really solve your specific problem, but it's worth a try to update the NIC firmware and driver.
Seems like this is a HP NC 331FLR NIC (DL gen8 default 4-port NIC with the BCM5719 chip).
There's no update binary you can run from 4.1, but you can update all firmware components with the current HP Service Pack for Proliant image:
Or boot the server into a live-Linux of your choice and use the Linux update binary:
http://www.hp.com/swpublishing/MTX-ec0e18db6a8e4d978b57aa95d1
These will update the 331FLR NIC to Boot Code version 1.37/NCSI 1.2.37.
Then update the tg3 driver in ESXi with this offline bundle to 3.129d.v40.1:
You need the offline bundle file (BCM-tg3-3.129d.v40.1-offline_bundle-1033618.zip) from this package. You can import that in the vCenter update manager for easier deployment or (probably) install it from the ESXi shell with esxupdate --bundle=/tmp/BCM-tg3-3.129d.v40.1-offline_bundle-1033618.zip
I'm a bit rusty in the ESXi 4.1 CLI department though, you might have to resort to the vihostupdate utility or with PowerCLI Install-VMHostPatch remotely:
Awesome,
Thank for you for detailing the upgrade process.
Will get this scheduled, if it ever gets approved (not sure we can get by with having one less host for now, may have to be postponed til we get another box)
Thanks again for your help, I now have enough workaround as well as plan to try to fix it.
Ed