Solved: Re: C7000 FlexFabric dropping pings intermittently... - Page 2

LeslieBNS9 · ‎08-06-2014

Hoping someone can help here as we haven't had much luck with HP support so far.

We have a C7000 chassis that has 2 HP VC FlexFabric 10Gb/24-Port Modules. We have 4 blades in the chassis running ESXi 5.5 Update 1.

On occassion we will lose pings to/from some of the VM guests running on the chassis. This does not happen to all VMs only some VMs some of the time. I am unable to reproduce the problem, I can only troubleshoot it as it happens.

We have narrowed down the problem to only the ESXi hosts on the chassis. I can migrate the VM amongst the chassis hosts and the problem persists, I migrate the VM to a host not inside the chassis and all is fine.

Right now my goto fix for the problem is to migrate the problem VM off the chassis.

We have done the following things to try and resolve this issue.

1. Upgraded all firmware for VC, OA and the Blades to the recommended firmware specs provided by HP. http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

2. Re-Installed the VMWare hosts using HP's Custom ISO

3. Validated that all drivers conform to HP's recommendations http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

4. We noticed that the Bay 1 flexfabric card is showing as subordinate/invalid. So we replaced the card in Bay 1 and it is still showing as insubordinate/invalid.

5. Validated that the network configuration in the chassis is accurate.

This is getting very frustrating to troubleshoot with HP support. They are convinced it's not a hardware problem even though the FlexFabric card in Bay 1 still shows insubordinate/invalid and the problem still persists.

Does anyone have any ideas on what else to try here?

LeslieBNS9 · ‎08-07-2014

I switched over to Shared Uplink Sets tonight. I'm going to monitor tonight and tomorrow to see if that resolves the problem.

LeslieBNS9 · ‎08-08-2014

So far since I implemented SUS the problem is not presenting itself. Although the problem has been intermittent and I cannot force it to happen. So I just have to wait to see if the problem occurs again to know for sure. It usually starts occurring again within a week.

LeslieBNS9 · ‎08-21-2014

It's been 2 weeks since we turned on SUS and we have not seen the problem occur again. I have been continuing to work with HP support to see if the issue is in fact when using VLAN Tunneling. SUS won't work for us long term due to the limitation of only being able to map 162 networks in the server profile, and we are very close to reaching this threshold.

Heidrick · ‎08-28-2014

We had a similar issue with the pings dropping and found it to be us overrunning the port group of the line card in the uplink switch connecting to the vc modules. This problem was usually noticed when there was alot of traffic on the system. We are using 4 10gb uplinks from the 4507 split between two line cards. Each line card backplane speed is rated at 48gb/s but that speed is divided amongst the 12 ports in groups of 3 thus giving each port group 12gb/s throuput. Here's a simple diagram for the line card1: 000 111 222 333 and line card2: 444 555 666 777 We had the uplinks for vc1 module plugged into the first two ports of 000 and the first to of 444. When looking at the switches we found overrun errors but only after upgrading all the firmware on the C7000 and blade servers to 4.9x. Before the firmware upgrade these errors weren't present on the switch but the problem persisted. We moved the second uplink of each vc to 111 and 555 respectively this eliminated the overrun errors and the ping problem disappeared.

LeslieBNS9 · ‎09-03-2014

We re-enabled VLAN tunneling over the weekend to see if the problem would return. Immediately after enabling VLAN tunneling the problem started up again. I moved back to SUS and the problem went away.

We have a support call with HP soon to get some network captures to validate where the packets are dropping. So we will see what results from that.

JPM300 · ‎09-03-2014

Thanks for the update LeslieBNS9, let us know what HP has to say when they find a resolution for you.

LeslieBNS9 · ‎09-11-2014

We finally narrowed down the issue.

The problem was due to a load balancer existing on the network that we were having problems with. There is a customer advisory for this issue.

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/kb/docDisplay?javax.portlet.begCache...

When VC is in VLAN tunnel mode, it maintains a single MAC Address table for the tunneled VC network even though it encompasses multiple VLANs (this is normal). The result is that when a host (physical or VM) inside the VC Domain sends a broadcast like an ARP, it is sent out the VC uplink on one VLAN, traverses through the load balancer and is broadcast on the load-balanced VLAN. If that VLAN is also sent to the VC uplink port, the MAC address of the host is learned outside of VC. Like any 802.1d bridge, subsequent traffic sent to that host's MAC address and received on the VC uplink is discarded as VC has learned that the MAC address resides outside the domain. The normal MAC address aging time in VC and most other switches is 5 minutes, so this condition will exist until the entry aged out.

The only way around this is to use Shared Uplink Sets or if possible disable gratuitous ARP on the load balancers.

JPM300 · ‎09-11-2014

Thanks for the update. What kind of load balances are you using just for reference.

LeslieBNS9 · ‎09-11-2014

Netscaler VPX's. We are going to try out some configs on our netscaler to disable gratuitous ARP.

LeslieBNS9 · ‎09-19-2014

We added subnet IP addresses to our NetScalers. We previously were using just mapped IP addresses. Apparently the way the netscalers handle these different types of IP addresses causes the ARP requests to go out on all interfaces. Our problem is now resolved.

JPM300 · ‎09-19-2014

ohh nice,

Thanks for the update. How did you end up tracking that one down?

LeslieBNS9 · ‎09-19-2014

We did some packet sniffing on the traffic from the Netscaler and noticed it was only happening for some IP addresses. We noticed that it was doing it on all the services in the NetScaler that were listed as down. We had decomm'd some services from the NetScaler but had not went back and cleaned up anything.

Then we were wondering how to work around the issue for legitimate services that are just in a down state for temporary (i.e. reboots or what not). Our Citrix guy did a bunch of research from there and found the mapped/subnet IP issue.

Now that we have that all resolved we are going to implement VLAN tunneling again and see how things go.

All

C7000 FlexFabric dropping pings intermittently to only some VM guests