VMware Cloud Community
fmcphail
Contributor
Contributor

NIC Teaming ESX3.5U4 - Extreme x450 Summit Stack - HP c7000 Blade

Hi I've come across an issue with either VMware, or my Extreme switches.

Not being sure where the problem is and having not so great interest from VMware and Extreme I thought I'd ask the community about the issue.

I have a set of 8 Extreme switches setup in a stack 2 x x450a, 6 x x250e (POE) switches. I have created a set of LAG or aggregated groups on my x450 switches. All forced to 1000 FULL, as are the NIC's in ESX. We have two pass through Ethernet ports in the rear of the c7000.

E.G.

configure ports 1:1 auto on speed 1000 duplex full

configure ports 1:1 auto-polarity off

configure ports 2:1 auto on speed 1000 duplex full

configure ports 2:1 auto-polarity off

enable sharing 1:1 grouping 1:1, 2:1 algorithm address-based L3

Have also tried:

enable sharing 1:1 grouping 1:1, 2:1 algorithm address-based L3 lacp

I have then in the ESX configuration for each host set the NIC's to teamed based on an ip hash.

When I do attempt to traverse traffic across the Ethernet, I see packet loss from 2% - 12%. This occurs when transferring data and ICMP requests across the ESX hosts VM's to a non ESX host. If I do the same test to another host; ergo.VM1 on ESX01 to VM2 on ESX02 I have no packet loss. I also have no packet loss when testing outside of the ESX hosts. In addition to this I don't appear to lose any traffic when transferring data to the ESX host itself.

Thus:

LAPTOP > SWITCH > ESX HOST > VM = Packet loss

LAPTOP > SWITCH > ESX HOST = No Packet loss

VM > ESX HOST 01 > SWITCH > ESX HOST 02 > VM = No Packet Loss

VM > ESX HOST > SWITCH > LAPTOP = Packet Loss

Looking through our switch logs I can't see any issues, and Extreme Networks have confirmed that the setup of the switch ports are correct. VMware support have had a look at the logs and advised that there are no issues that they can see.

I've replaced all the cables, re-cabled twice, and had three other people look over the cabling to ensure the correct ports were connected and we didn't have a miss matched port combo. STP has been enabled on the switches and showed no issues. The servers in the c7000 (BL460, and BL480 servers) have all had the latest firmware packs applied as well.

Just wondering if anyone has seen this issue and had it resolved.

Thanks.

0 Kudos
10 Replies
deecha
Contributor
Contributor

Hi:

I am curious as to know whether you found a solution to your problem ?

Do you notice a problem like this on non-aggregated ports connected to the VMware machines ?

Thanks

0 Kudos
deecha
Contributor
Contributor

Try upgrading to vSphere 4.0 and see if it makes the problem go away.

I also noticed that you have grouped ports from two different slots. I don't know if this could pose an issue.

0 Kudos
tcutts
Enthusiast
Enthusiast

I am seeing the same issue on our vSphere ESX 4 servers currently, connected to an Extreme switch stack. Our pServers are in a single c7000 chassis with dual gigE pass through modules for connection to the switches.

We now suspect, in our case, that this is a hardware fault, nothing to do with VMware, and the reason is simple:

We reinstalled one of the blade servers with a regular Linux OS, using only one of its NICs. It still has terrible network performance, and experiences packet loss. We patched it into a different switch, and same deal. So, it's not the switched, and it's not VMware. Our current theory is a bad midplane; I've tried removing each of the passthroughs in turn (which of course drops one of the network paths to each vSwitch) but the performance hit remains, so unless both passthroughs are faulty (doubtful) that leaves the midplane.

Of course, we're now left with planning how to replace the midplane without total downtime. The most likely course of action is to temporarily deploy some additional servers elsewhere as ESX servers in the same cluster, and migrate the VMs off in order to (a) test whether the performance is back to normal in their new temporary home and (b) replace the midplane if necessary.

Regards,

Tim

0 Kudos
fanu
Contributor
Contributor

Hi,

We probably have the same problem on 2x c7000 enclosures.

Our config is : 2x c7000 with 16 BL460c G1 & 4x HP 1Gb Ethernet Pass-Thru Module for c-Class BladeSystem on each.

When we plug 15 network cards on the same switch from 1 enclosure, no problem.

When we plug 16 network cards on the same switch from 1 enclosure, 2 to 10% packet loss.

When we plug 17 network cards on the same switch from 1 enclosure, 50% packet loss.

Did your midplane replacement solve the problem ?

Regards,

François FANUEL

0 Kudos
tcutts
Enthusiast
Enthusiast

Yes, in our case the midplane replacement solved the issue completely. Do some of the tests I did though, to try to pin it down. Re-deploying one of the blades with a normal OS was the kicker for us.

Regards,

Tim

0 Kudos
fanu
Contributor
Contributor

Thanks a lot for your quick answer.

Can you provide us you HP case number ? This my really quick the diagnostic on HP side.

Thanks a lot for your help and your time.

Regards,

François FANUEL

0 Kudos
tcutts
Enthusiast
Enthusiast

I'm afraid to say that my optimism was misplaced. The problem has recurred, about two weeks ago. Exactly the same symptoms. So, I did the same as before - one by one I put the hosts into maintenance mode, and moved them to a second chassis. I left only one machine (not an ESX host, a Debian host) behind. Once all the ESX hosts were gone, I tested the network performance of the remaining Debian host, and it had returned to normal, as had the network performance of all the VMs in their new home.

Therefore, I am no longer convinced this is a hardware problem, per se, even though it looked like it.

I'm flummoxed again. Clearly, something happens which makes network performance abysmal. ping times to virtual machines become extremely erratic. The performance of some protocols becomes truly abysmal (reading from an NFS external server at only 81 kilobytes a second, for example). As soon as all the ESX servers were moved to a different chassis (but still connected to the same switch) reading from the same NFS server was more than a thousand times faster.

What could this be? I don't know. It's possible the network cards are getting into a state, and rebooting the physical hosts clears it? It's also possible that something that ESX does confuses the midplane of the chassis, or the gig E passthrough modules but I doubt that.

Are you using VMware's drivers for the network card, or are you using the supplemental drivers from HP's website? I'm using HP's drivers.

One suggestion a colleague made is the TCP offload setting. We have a lot of HP blades in other roles, and colleagues have had trouble with TCP offload. I'm planning to have a look at this today, if I can find out where to configure it. Detailed documentation seems to be rather thin on the ground.

0 Kudos
fanu
Contributor
Contributor

Hi,

We finally found the real cause of our similar problem.

Remember, we've got 2x c7000 enclosures with 16 blades & 4x 16 ports GigE Passthrough modules on each.

The problem happens when more than 15 network ports are connected from 1 enclosure (no matter the passthrough module) to 1 switch. When 16 ports are connected, we've got about 5% packet loss. When 17 ports are connected, we've got about 50% packet loss ... and when we shut 2 ports (back to 15 ports connected) we're back to normal.

So we opened a case to HP support. The level2 guy asked the serial numbers of our passthrough modules, and found that all our modules are in a faulty group.

If it can help, the internal HP solution ID for this problem is : emr_na-c01711775-4

HP support sent us new modules. We'll change them in the beginning of January. I'll keep in touch if that solve this issue.

Regards,

François FANUEL

tcutts
Enthusiast
Enthusiast

That's very interesting indeed. We do indeed have all 16 ports of our passthroughs in use. Does the problem occur as soon as you use all 16 ports? Or do they all have to be actively being used? Do you happen to have the serial numbers of your faulty modules to hand, so I can compare with ours? Feel free to post a private message to me if you don't want them recorded publically...

Tim

0 Kudos
fmcphail
Contributor
Contributor

Hi, we eventually had our passthrough adapters replaced. I had to litterally beg HP for assistance, and then when it wasn't forthcoming I asked a friend who lent me one for a week. With a new passthrough inplace we saw the problem dissapear.

After this we were eventually able to get HP to swapout the passthough cards. We've more or less been problem free since then. (More of less...)

0 Kudos