VMware Cloud Community
vertices86
Contributor
Contributor

Intel X520-DA2 dropping randomly - SR-IOV different between hosts

Greetings!

Hoping someone can help me out with this, I've spent hours on it....

This is for my homelab so some hardware is unsupported. 

I have two OptiPlex 7080s with mostly identical hardware. Same memory, CPU, NICs, just some different local drives. Call them esx1 and esx2. I have combed through the BIOS on both, which is identical and up-to-date as well, and everything is set the same. I have a problem with esx2 where one of the ports on the X520-DA2 card just drops. It stops passing all traffic though the managed switch it's connected to shows it has link. ESXi says the link is down. A reboot of the host will bring it back up, but then it drops again within a couple of days at most.

I have reviewed this vmware article (can't link it) as it sounds similar: "Physical network connectivity lost on servers with activated ASPM (2076374)"

I have verified that ASPM is disabled on both hosts as per the link above. I have swapped the identical SFP+ DACs. I have swapped switch ports. I have swapped the x520 out with another. The problem is always the same port, which leads me to believe it's an issue with ESXi. I did have to edit the firmware with ethtool to allow non-intel DACs, but I did that to all 6 of these cards I have and all the rest are fine. esx1 never has a problem it's always with this esx2. Both are freshly installed and running VMware ESXi, 7.0.3, 20842708. Running "esxcli network nic get -n vmnic0" returns identical results on both hosts.

The only difference I can find is that on esx1, it says SR-IOV is "not capable" for both X520s under PCI Devices. On esx2, it shows SR-IOV as disabled, but configurable. I have no idea if this is relevant, and I have no idea why on one system it's not capable but on the other it's just disabled.

Anyone have any clue before I flatten this esx2 host and start over? I bought both of these and set them up identically, so I'm having a real hard time discovering why they appear to have some differences.

Thanks in advance.

 

0 Kudos
3 Replies
a_p_
Leadership
Leadership

Only a guess, but maybe worth a try.

I experienced a similar issue with Intel NICs some time ago. What solved the issue for me, was to disable ACPI power savings by setting the ESXi Power Management from "Balanced" to "High Performance". This may of course cause a higher power consumption, but since you say this happens every few days, it might help in troubleshooting the issue.

André

0 Kudos
vertices86
Contributor
Contributor

Thank you, I just changed both from Balanced to High Performance. Any ideas why one X520 would be SR-IOV capable and the other not? The firmware is identical on all of my X520s (whatever the latest was from Dell the other day), and both are using the ixgben driver. I can't find any differences in the BIOS between esx1 and esx2.

0 Kudos
vertices86
Contributor
Contributor

Well that didn't work. Still dropped, and fairly quickly this time. Nothing in the logs that is any indication of the root cause. Things hum along fine then all of a sudden I get: 

Date Time:
 01/17/2023, 8:30:25 PM
Type:
 Warning
Target:
esx2
Description:
Physical NIC vmnic2 linkstate is down.
Related events:
There are no related events.
 
 
While it's down, I still show link on the card lights and switch also shows its up. A reboot fixes it again for awhile. This, coupled with the fact that SR-IOV options differ for some unknown reason is just driving me batty. 😞
0 Kudos