VMware Cloud Community
golddiggie
Champion
Champion

Intel i350 T4, ESXi 5 changes without prompting

We have several IBM x3650 M3 (and two x3550 M3) servers with Intel i350 pNICs (quad port) that are having issues. The problem seems to be more isolated to port groups that have a single VLAN ID assigned to them. This includes the Management Network and vMotion networks on some hosts, and a VM network on other hosts.

I've been working with IBM on this, and they're claiming it's not a hardware issue. I'm working with VMware support on this (have SR's open) trying to see where the issue source resides. I do find it interesting that this isn't happening on ALL the x3650 M3 hosts. We have two 4 host clusters (for VDI) that are not having this issue.

How the issue makes itself known is the VM's on the port group that has the problem are suddenly unavailable. They don't respond to a ping, they cannot get out to the network, etc. If the port group is set with a VLAN ID, I have to set it to either none(0) or all(4095) in order for the VM's to have their network connection restored. This happens without any warning, or prompting. There are hosts that I've had to change the port group VLAN ID on before, that I later have to change back.

As you can imagine, this is very frustrating. Not to mention it makes US look bad.

At this time, I have a host in maintenance mode, so we can test with it. I've changed the management port group on it both ways and can reproduce the issue easily (for that group). Since it's not hosting VM's, I can't exactly test the vMotion network connection.

Any ideas?? Come on people, we can't be the only ones having this issue.

Reply
0 Kudos
2 Replies
a_p_
Leadership
Leadership

As a first step I'd try to find out whether it is port group, vSwitch or uplink (physical port) based. When this happens are the VM's on this same port group still able to ping each other?

How are the policies set on the port group and the vSwitch? And how did you configure the physical switch ports?

André

Reply
0 Kudos
golddiggie
Champion
Champion

Had the network guys involved from the start. We've ruled out the network switches as the problem.

I had them change one of the hosts Management and vMotion physical ports to be trunked (same VLAN's on both sets, instead of just one VLAN on each). I can now set the port group to use VLAN ID's and it works. At least for now. My main concern is any fix I implement seems to only be temporary.

I'm in the process of updating the NIC drivers in ESXi5 [update 1] as per the VMware tech... Once that's done, I'll pull that host out of maintenance mode (after configuring the rest of the vSwitches and port groups) and test it.

I've also done a deep dive on the vSwitch and port group settings on the hosts. Everything matches up. This is also happening to servers that have been online for some time now (well over a year). They only started having the issue fairly recently (within the last month, or two). We did have a run of NICs that shipped with the IBM servers that were NFG and had to be replaced right off the bat (9 of them, over 4 hosts). I've also had other issues with the servers we've been getting, which means IBM doesn't have a nice place in my heart. I'm also not too happy about how the vendor doesn't seem to do a deep enough test of the hosts before shipping them over to us. It's not IBM themselves since the company the servers are purchased through put them together. I've had several that arrived where add-on cards were not working (at all). I've also had many arrive that were missing hard drive filler plates. The filler plates are minor, but not seeing hardware in the bios/firmware is rather big. I don't care if the diagnostic tool sees the hardware (at some level) if it doesn't show in bios, then ESXi won't see it.

Reply
0 Kudos