I have come across an interesting issue with a new HPE platform. The system is running within a C7000 BladeSystem, with BL460c Gen9 blades.
We have noticed some degradation in performance on iSCSI connection (using the Software iSCSI initiator), this traffic runs over vmnic1 and vmnic2 details from the NIC list are below.
vmnic1 0000:06:00.1 elxnet Up | Up | 10000 Full | 32:a6:05:e0:00:be 1500 Emulex Corporation HPE FlexFabric 20Gb 2-port 650FLB Adapter |
vmnic2 0000:06:00.2 elxnet Up | Up | 10000 Full | 32:a6:05:e0:00:bd 1500 Emulex Corporation HPE FlexFabric 20Gb 2-port 650FLB Adapter |
Each NIC is reporting at 10000 Mb full, however I am not able to set the speed on the ESXi server. vmnic1 reports the following for advertised link modes;
[root@ESX:~] esxcli network nic get -n vmnic1
Advertised Auto Negotiation: true
Advertised Link Modes: 1000BaseKR2/Full, 10000BaseKR2/Full, 20000BaseKR2/Full, Auto
Auto Negotiation: true
Where as vmnic2 reports the following modes
[root@ESXi2b-14:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: false
Advertised Link Modes: 20000None/Full
Auto Negotiation: false
Confused, the settings are identical for these within OneView. Both NIC's are using firmware - 12.0.1110.11 from SPP 2018.06.0. The HPE ESXi image has been used including driver version 12.0.1115.0 which shows as being compatible on the comparability guide VMware Compatibility Guide - I/O Device Search.
Has anyone else seen this issue? If I try and manually set the speed/duplex settings via esxcli it fails with the following error in the vmkernel.log
2018-08-14T23:49:41.361Z cpu20:65677)WARNING: elxnet: elxnet_linkStatusSet:7471: [vmnic2] Device is not privileged to do speed changes
As a result of this when using HCIBench to test the storage throughput the 95%tile_LAT value is reading excessively when traversing vmnic2 - 95%tile_LAT = 3111.7403 ms
Any thoughts??
Interesting Can you share the complete output of the below commands?
esxcli network nic get -n vmnic1
esxcli network nic get -n vmnic2
Cheers,
Supreet
Sure thing.
[root@ESX:~] esxcli network nic get -n vmnic1
Advertised Auto Negotiation: true
Advertised Link Modes: 1000BaseKR2/Full, 10000BaseKR2/Full, 20000BaseKR2/Full, Auto
Auto Negotiation: true
Cable Type:
Current Message Level: 4631
Driver Info:
Bus Info: 0000:06:00:1
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 12.0.1115.0
Link Detected: true
Link Status: Up by explicit linkSet
Name: vmnic1
PHYAddress: 1
Pause Autonegotiate: true
Pause RX: true
Pause TX: true
Supported Ports:
Supports Auto Negotiation: true
Supports Pause: true
Supports Wakeon: true
Transceiver: external
Virtual Address: 00:50:56:59:d7:63
Wakeon: MagicPacket(tm)
[root@ESX:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: false
Advertised Link Modes: 20000None/Full
Auto Negotiation: false
Cable Type:
Current Message Level: 4631
Driver Info:
Bus Info: 0000:06:00:2
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 12.0.1115.0
Link Detected: true
Link Status: Up by explicit linkSet
Name: vmnic2
PHYAddress: 0
Pause Autonegotiate: true
Pause RX: true
Pause TX: true
Supported Ports:
Supports Auto Negotiation: false
Supports Pause: true
Supports Wakeon: false
Transceiver: external
Virtual Address: 00:50:56:58:05:51
Wakeon: None
Really hoping that this isn't something simple that I have missed.
Thanks, Ben.
I also tried to set the interface to 10Gb Full via esxcli;
esxcli network nic set -n vmnic2 -S 10000 -D full
It failed as expected;
2018-08-15T10:26:55.023Z cpu17:68364 opID=e4ebaba5)Uplink: 14445: Setting speed/duplex to (10000 FULL) on vmnic2.
2018-08-15T10:26:55.024Z cpu47:65677)WARNING: elxnet: elxnet_linkStatusSet:7419: [vmnic2] Speed 10000 is not supported on this phy interface (0xc)
I have a case open with HPE on this too, interesting indeed.
Per my understanding, below could be the issue here -
esxcli network nic get -n vmnic1
Bus Info: 0000:06:00:1 --> PF 1
esxcli network nic get -n vmnic2
Bus Info: 0000:06:00:2 --> PF 2
In case of a multi-channel mode, same physical port will be shared among multiple PFs. PF-1 could be the primary PF and PF-2 could be treated as non-primary PF.
Emulex firmware might not be allowing the non-primary PFs to modify the port level settings such as auto-negotiate, etc.
This is to avoid multiple PFs choosing different settings which is not possible since, the physical port is same. And this is why we might be seeing the below error in the logs -
2018-08-14T23:49:41.361Z cpu20:65677)WARNING: elxnet: elxnet_linkStatusSet:7471: [vmnic2] Device is not privileged to do speed changes
Good that you have already involved HPE on this. I would be very eager to know what they have to say about this
Please consider marking this answer as "correct" or "helpful" if you think your questions have been answered.
Cheers,
Supreet
Thanks for the input so far Supreet.
In our case vmnic1 and vmnic 2 will be using 2 different physical ports, as they are leaving the chassis via different interconnects.
Still chasing HPE with this, sending a nice collection of log files over to them for this now. I'll keep you posted with their response.
Cheers, Ben.
Ahh! Will be eagerly waiting to know how this pans out
Cheers,
Supreet
Just guessing.
André
Morning André
Cheers, Ben.
I have had an interesting development in this, I thought I would share.
HPE are working on this now, I don't expect there to be a resolution to this any time soon though.
We are using the HPE customised ESXi 6.5 U2 image, which includes elxnet driver version 12.0.1115.0.
Driver Info:
Bus Info: 0000:06:00:2
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 12.0.1115.0
Running this version of the driver, the NIC doesn’t list all the correct speeds at advertisement.
[root@ESXi1a-21:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: false
Advertised Link Modes: 20000None/Full
Auto Negotiation: false
Although other NIC's on the same host display the correct speed advertisements.
[root@ESXi1a-21:~] esxcli network nic get -n vmnic1
Advertised Auto Negotiation: true
Advertised Link Modes: 1000BaseKR2/Full, 10000BaseKR2/Full, 20000BaseKR2/Full, Auto
Auto Negotiation: true
If I install ESXi 6.5 U2 via a direct download from VMware, this installs elxnet driver version 11.1.91.0.
Driver Info:
Bus Info: 0000:06:00:2
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 11.1.91.0
Running this version of the driver, the NIC doesn’t list all the correct speeds at advertisement.
[root@localhost:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: false
Advertised Link Modes: 20000None/Full
Auto Negotiation: false
If I use the HPE 6.0 U3 image this installs elxnet driver version 12.0.1115.0 which exhibits the same issue as the 6.5 U2 image.
Now for the interesting part. If I install ESXi 6.0 U3 natively from the VMware website elxnet driver version 10.2.309.6v is included.
Driver Info:
Bus Info: 0000:06:00:2
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 10.2.309.6v
This driver version reports the correct available speeds.
[root@localhost:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: true
Advertised Link Modes: 1000baseT/Full, 10000baseT/Full, 20000baseT/Full
Auto Negotiation: false
Nothing else has changed at all on the system, other than the ESXi image that has been used.
I'm curious if anyone else has ever come across this issue, it seems to be a potential driver issue but I don't understand how, if this is a driver issue it hasn't been noticed in the past.
Very interesting What if you install the latest VMware native driver on 6.5 U2? Does the issue persist? This is just to isolate if it is a problem with all the versions of elxnet async driver.
Cheers,
Supreet
Certainly does Supreet.
I even for the sake of playing devils advocate installed 6.7, and the native driver in there also exhibits the same problem.
HPE are due to get access to lab hardware today/tomorrow to start replicating. I'll keep you posted!
Would love to know the end of this Thank you for keeping us posted.
Cheers,
Supreet
I am seeing the same thing, let me know what you find out. This is driving me nuts not being able to have a consistent host profile. FYI I am using:
Driver: elxnet
Firmware Version: 12.0.1110.11
Version: 11.4.1205.0
Interesting, good to hear that we are not alone with this issue. HPE have gone very quite on this one at the moment, will keep the thread up to date though as and when updates come through.
I have some progress from HPE!
They have now been able to replicate the fault and have acknowledged that this could well be a driver issue
The issue has now been escalated from the L2 engineers to the L3 engineers for further testing. They have also said that they will be looking for other customers that have reported this issue globally. If anyone has this issue, please log a support request with HPE, drop me an email/message on the VMTN and I'll pass you the incident to reference this with HPE as well so they can tie them together.
Cheers, Ben.
Hello I have this same issue but we do not use iscsi, instead we use fc. We experience disconnections from our redundant paths to our SAN and only the hosts with the hardware from the title are affected. Any updates from HP?
Still very much a work in progress with HPE at the moment. Still pushing them, latest is they need to work with VMware and the hardware vendor with the potential of a new driver to be developed.
Interesting to hear that I am not the only one seeing this issue. If you have the ability to log this with HPE then more cases with the same issue with strengthen the case.
I will keep this thread up to date with anything useful though when/if it arises.
Any word from HPE on this? We are running into the same issue. Interesting little twist to what we are seeing is that if we put a significant amount of load on vmnic2 or vmnic3 the links will drop completely. HPE hasn't been able to pin down the issue for us yet.
I have had some feedback, but nothing of any significance. They then also proceeded to close my case, despite my request for more time to test this out and produce more evidence as while there seems to be some merit to the details below - it doesn't answer why I see the latency.
--------------------------------------------------------------------------------------------------------------------------------------------------
This behaviour is expected in case of multi-channel modes.
The same physical port will be shared among multiple logical functions in case of a multi-channel mode.
For example, Port #A is associated with even numbered logical functions (i.e. 0,2,4,6, etc).
and Port #B is associated with odd numbered logical functions (i.e. 1,3,5,7, etc.).
Emulex Firmware design is such that, only primary logical functions (i.e. logical function 0 for Port #A and logical function 1 for Port #B) are privileged to modify Port level features like PortSpeed, Autoneg, etc..
This is to avoid multiple logical functions choosing different settings which is not possible since the physical port is same.
That is the reason that the driver is not advertising negotiation for those non-primary logical functions.
---------------------------------------------------------------------------------------------------------------------------------------------------
I have pulled a server from the cluster and intend on doing so further testing on this. However, recent workload has prevented focus on this issue but I am hoping to have some time for testing on this in the next week or so.