Solved: C7000 FlexFabric dropping pings intermittently to ...

LeslieBNS9 · ‎08-06-2014

Hoping someone can help here as we haven't had much luck with HP support so far.

We have a C7000 chassis that has 2 HP VC FlexFabric 10Gb/24-Port Modules. We have 4 blades in the chassis running ESXi 5.5 Update 1.

On occassion we will lose pings to/from some of the VM guests running on the chassis. This does not happen to all VMs only some VMs some of the time. I am unable to reproduce the problem, I can only troubleshoot it as it happens.

We have narrowed down the problem to only the ESXi hosts on the chassis. I can migrate the VM amongst the chassis hosts and the problem persists, I migrate the VM to a host not inside the chassis and all is fine.

Right now my goto fix for the problem is to migrate the problem VM off the chassis.

We have done the following things to try and resolve this issue.

1. Upgraded all firmware for VC, OA and the Blades to the recommended firmware specs provided by HP. http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

2. Re-Installed the VMWare hosts using HP's Custom ISO

3. Validated that all drivers conform to HP's recommendations http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

4. We noticed that the Bay 1 flexfabric card is showing as subordinate/invalid. So we replaced the card in Bay 1 and it is still showing as insubordinate/invalid.

5. Validated that the network configuration in the chassis is accurate.

This is getting very frustrating to troubleshoot with HP support. They are convinced it's not a hardware problem even though the FlexFabric card in Bay 1 still shows insubordinate/invalid and the problem still persists.

Does anyone have any ideas on what else to try here?

LeslieBNS9 · ‎09-11-2014

We finally narrowed down the issue.

The problem was due to a load balancer existing on the network that we were having problems with. There is a customer advisory for this issue.

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/kb/docDisplay?javax.portlet.begCache...

When VC is in VLAN tunnel mode, it maintains a single MAC Address table for the tunneled VC network even though it encompasses multiple VLANs (this is normal). The result is that when a host (physical or VM) inside the VC Domain sends a broadcast like an ARP, it is sent out the VC uplink on one VLAN, traverses through the load balancer and is broadcast on the load-balanced VLAN. If that VLAN is also sent to the VC uplink port, the MAC address of the host is learned outside of VC. Like any 802.1d bridge, subsequent traffic sent to that host's MAC address and received on the VC uplink is discarded as VC has learned that the MAC address resides outside the domain. The normal MAC address aging time in VC and most other switches is 5 minutes, so this condition will exist until the entry aged out.

The only way around this is to use Shared Uplink Sets or if possible disable gratuitous ARP on the load balancers.

View solution in original post

LeslieBNS9 · ‎08-06-2014

I forgot to mention we also put the NICs in Vmware in active/standby instead of active/active.

vmrulz · ‎08-06-2014

This is a very similar configuration to what I'm running. Not sure how you're config'd but I'll tell you part of what we do and see if that helps.

We have 2 10GE ports on Flexfab in bay 1 in an LACP to the upstream switch and 2 ports in Bay 2 in an LACP to the upstream switch (HP IRF cluster). You create a SUS out of each paired uplinks and carve VLAN networks out of those SUS's. Note we are running VC firmware 4.10 and VCEM.

This presents two physical NIC's via Virtual Connect to each Host. If you make the mistake of trying to team these together at the ESXi end, networking will be wonky (because you can't team across VC modules only ports on a given module). We run ours in an Active/Standby configuration.

We also setup vmotion networking as direct connects between 2 enclosures which makes for the fastest vmotions I've ever witnessed.

HTH

Ron

JPM300 · ‎08-06-2014

Hey Leslie,

When this ping issue occurs can the ESXi host ping the VM successfully or does it drop packets as well?

How do you cut up your flex adapter to VMware? Is each VMNIC 1GB, 2GB, 4GB?
How is the vSwitch setup done in this 4 server enviroment?

When you vmotion over the VM having pinging issues to a host outside the Blade Chassis does this VM stay on the same network / VLAN

If you compare a wireshark capture of the VM while running on the HP Blade Chassis and compare it to a quick wireshark capture when it isn't runnning on the blade chassis do you get the same kind of results / broadcasts?

Does this happen to all VMs or just some when this does occur?

Does it only happen to one side of the flex, or have both sides been acting up? What I mean by this is which flexadapter is your VMNIC routing to when this occurs and if you put the VM on a different VMNIC that routes to the other flexadapter does the problem still occur? You can bring down nics in VMware with the following command: esxcli network nic down -n vmnicX and esxcli network nic up -n vmnicX to try and test this. Please test this command on a nic with no Vm's running on it first to make sure no outages occur

VMware KB: Forcing a link state up or down for a vmnic interface on ESXi 5.x

I also found this:

Vsphere 5.5 and Emulex OneConnect 10Gb NIC trouble

Let us know,

Sorry for all the questions :smileysilly:

LeslieBNS9 · ‎08-06-2014

Ours is setup just like you have listed yours to be. The only difference is we are running VC 4.20.

LeslieBNS9 · ‎08-06-2014

JPM300 thanks for your reply.

How do you cut up your flex adapter to VMware? Is each VMNIC 1GB, 2GB, 4GB?

The FlexNics connect at 10GB to the blades. Then we have 2 Ports coming out of each of the FlexFabric bays with LACP enabled.

How is the vSwitch setup done in this 4 server enviroment?

There are 2 vmnics 1 to each FlexFabric Bay. We have the vmnics configured active/standby.

When you vmotion over the VM having pinging issues to a host outside the Blade Chassis does this VM stay on the same network / VLAN

Yes it stays on the same network. Even the same backend switches that the C7000 connects to.

If you compare a wireshark capture of the VM while running on the HP Blade Chassis and compare it to a quick wireshark capture when it isn't runnning on the blade chassis do you get the same kind of results / broadcasts?

I have not tried getting a network capture together. I think I'll go ahead and get some captures going. Although reviewing them I'm sure will make my head explode.

Does this happen to all VMs or just some when this does occur?

It does not happen to all VMs. It only happens to a few at a time and usually after a few hours it resolves itself. Or I just migrate the VMs to non-blade hosts. Then I just have to wait to see if it'll happen again.

Does it only happen to one side of the flex, or have both sides been acting up? What I mean by this is which flexadapter is your VMNIC routing to when this occurs and if you put the VM on a different VMNIC that routes to the other flexadapter does the problem still occur?

It happens from either side. We've been able to confirm this by putting the vmnic's in active/standby. We ran for awhile with the vmnic attached to Bay 1 running as the active NIC and then later with the vmnic in Bay 2 as active. Both resulted in the same problem.

I also found this:

Vsphere 5.5 and Emulex OneConnect 10Gb NIC trouble

i am going to try enabling the legacy driver now. It just so happens the problem is actively happening at this moment so I can confirm is if this helps.

LeslieBNS9 · ‎08-06-2014

Just tried the legacy be2net driver and that didn't resolve the problem.

JPM300 · ‎08-06-2014

Hmmm well it seems like you have done most of the troubleshooting that comes to mind. I would say see what the wireshark shows as it almost sounds like you are getting brief broadcast storms due to a bad part, possible a bad flex adapter? With your HP support case I would maybe lean on them to start replacing parts until you can resolve the issue if the wireshark comes up dry, or maybe also engage VMware support, however it will probably end up back in HP's court, but no harm in trying.

LeslieBNS9 · ‎08-07-2014

I took some wireshark captures from the source and destination (problem machine). I also migrated the problem machine off the chassis and captured everything when all is working fine. After reviewing the results nothing is jumping out at me as a problem.

I can tell that the dropped pings are not even getting to their destination. Every single ping the destination received they reply to. So the packet is getting dropped before it even gets to the problem VM. What's interesting is the behavior only presents itself when pinging to the problem VM.

Server A - Machine I am using to troubleshoot the environment

Server B - Problem VM

When I ping from A -> B you see the ping drops. But when I ping from B -> A nothing is dropping.

Just have to narrow down where the packet is dropping I guess. That would narrow down the problem. I'm thinking of setting up a Network Analyzer port in virtual connect to verify the packet is getting to validate if the packet is even getting there.

I'm already leaning heavily on HP support to replace more parts. The fact that the FlexFabric card is still showing "Invalid" even after replacing it concerns me. But they are leaning back suggesting it's not a hardware problem. But they can't give me an answer as to why the card is showing Invalid.

I also opened a case with Vmware to see if they can assist in any way.

LeslieBNS9 · ‎08-07-2014

Looks like I can't configure virtual connect to network analyze the uplink ports, only the ports hooked up to the blades. I'm going to get a capture going from the ESXi hosts and see if the packets show up there. After that I'll setup a network analyzer port on our core switches to see if they are sending the packet over to the chassis. The only problem with all of this is we are running 10GB and I'm not sure I have a server that can keep up with that.

JPM300 · ‎08-07-2014

Yup all sounds good Leslie, good work!

It almost looks like some of the packets are dropping at the switch for some reason, almost as if just some packets don't get tagged or something, or the switch doesn't think it belongs in that network. Very odd. Keep us posted!

LeslieBNS9 · ‎08-07-2014

The more I dive into this I'm convinced something is going on with the chassis. I just need to be able to convince HP of that.

I gathered some network captures at various points throughout the network to narrow down where the packet loss is occurring. Here are the results.

VM on ESXi host OUTSIDE of the chassis to VM on ESXi host OUTSIDE the chassis. No ping drops.
VM on ESXi host OUTSIDE of the chassis to VM on ESXI host INSIDE the chassis. Ping drops.
VM on ESXi host INSIDE of the chassis to VM on ESXI host INSIDE the chassis (Not the same host as source VM). Ping drops.
VM on ESXi host INSIDE of the chassis to VM on the SAME ESXi host INSIDE the chassis. No ping drops.

In the 2 cases where there are no ping drops the packets do not traverse the switch on the C7000. In the 2 cases where we see drops it traverses the switch on the C7000. It HAS to be something on the C7000.

LeslieBNS9 · ‎08-07-2014

vmrulz -- Do you use Shared Uplink Sets or do you do VLAN Tunneling with an Ethernet network?

vmrulz · ‎08-07-2014

As I mentioned we use SUS's

LeslieBNS9 · ‎08-07-2014

So do you not let VMware manage the networks with VLAN tags? From what I can tell SUS strips the VLAN tags. We are using Ethernet Networks with VLAN Tunneling enabled.

vmrulz · ‎08-07-2014

SUS's have been around a longer than tunneling (I've used VC since version 1.x which was quite an adventure... not!) and I've never seen the need to use tunneling. I simply carve out Ethernet Networks for each VLAN from the SUS then map those as multiple networks to the physical ports presented to the host. In ESXi just create port groups for each VLAN on the same vswitch. The cookbook goes over both methods as I recall. We use VCEM (don't waste your money on that product) but the same settings are in the given module VCM. The graphic below shows all the bay1 Ethernet networks that were carved out of SUSA as I call it.

LeslieBNS9 · ‎08-07-2014

I think I may have to try this then. I'm assuming it retains the VLAN tag to Vmware?? So I'd still specify the VLAN ID in the port group on the VMware side?

vmrulz · ‎08-07-2014

Oh yes the tags are still passed along for vmware. Ping me if you have any questions on setting it up. I think this is the scenario I used from the cookbook

LeslieBNS9 · ‎08-07-2014

Ok that makes more sense to me now. I am just going to hate adding ALL of my VLANs to the Shared Uplink Set. Seems like such a pain. I will try that out next.

vmrulz · ‎08-07-2014

Yeah that does suck.. somebody created crazy class B networks at my current job which makes for fewer VLAN's but many more problems.. <smacks forehead with hand>

You might look at the VCM CLI rather than the GUI if you have a lot of repetitive stuff like that to do. http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fh20564...

All

C7000 FlexFabric dropping pings intermittently to only some VM guests