VMware Cloud Community
Kevin_Brown
Contributor
Contributor

vSwitch drops adapters ESX 3.0.1/3.0.2

Hello, ran across an interesting issue yesterday and wondering if anyone else has seen it.

Upgraded the memory in 4 Dell 6850 servers yesterday from 32GB to 64GB. When the server was booting, on 3 of them they restarted at least once after running a reconfiguration process during the ESX loading. (It may have happend on all 4, but we were not watching the console on the first server). Any way, when the last 3 of the 4 servers finally booted we noticed that one of the 2 vSwtiches that connect to the physical environment lost both of the network adapters, therefore not servicing some of the vLans that the vms were on.

Has any one seen this behavior before? I did find a VMWare knowledge base article that talks about adding network adapters causing the existing adapters to rename, and therefore become un-linked(un-associated) with a vSwitch. I am sort of wondering if this is what happened...adding memory cause the reconfiguration to happen, and there may have been something in the existing network configuration that got re-configured also, causing a rename or similar???? It looks like the NICs may in fact have been renamed, so that would lend itself to being answered by that knowledge base article. In looking more indepth at the network configs across the 3 servers that are in question, it seems the assigned NIC names are not consistent with each other, nor are they consistent with the intial server. Is there a way to "lock" the names I want to the NICs and make sure they aren't changed by a reconfiguration process? Is this something I am just going to have to live with and make sure is documented to check on?

Any thoughts or advice would be appreciated.

Thanks

Kevin

0 Kudos
9 Replies
admin
Immortal
Immortal

The NIC names are "locked" to the PCI bus slot and function assigned to the PCI devices by the system hardware. The PCI specs do not dictate how those bus/slot/function addresses are assigned to specific PCI device, this means its up to each vendor to decide how those addressses are assigned. For certain motherboard vendors adding or removing hardware will cause the renumbering of the PCI slots. You seem to have hit that issue by adding more RAM.

Since there is no way to actually Name the devices persistently beyond tying them to the bus/slot/function the probably will remain for the forseeable future.

0 Kudos
Kevin_Brown
Contributor
Contributor

Yeah that is what I figured has happened at the hardware level.

Clearly though the visual names of the nics (shown in the networking configuration portion of the VI Client) though are not related to the actual device name/id since they are named VMNIC with a number appended, but that number seems to be arbitrary. The reason I am saying that is that these servers have embedded NICs as well as separate PCI cards...true they are all on the PCI bus, and I amnot sure that the add-in cards are in the same slot, but assuming that there is some consistency with the embedded equipment, all of those should have the same device ID, and since all of 4 of these servers started at the same memory amount, they should have the same PCI device ID (again, talking about the embedded devices only)...therefore, if the appended number to the VMNIC label was derived from the device PCI ID then they should all be the same, yet they are not, so I am wondering if there is a way to set it such that there is some consistency?

It almost seems to me in some of the other info I have found that the association between the vSwitch and the physical nic is as simple as it just links the vmnicXX name and not an underlying device ID. So understanding that if the physical hardware device ID changes ESX will recognize it as a "new" nic... I am just trying to find a solution such that a less knoweldgeable person can look and say vSwitch0 = VMNIC0, vSwitch1 = VMNIC1, vSwitch2 = VMNIC2/VMNIC3, etc. Currently it is all over the place, and straightening out would help for long term consistency ans support.

0 Kudos
admin
Immortal
Immortal

The visual names as you called them are definately tied to the hardware address. Its the only key we have to identify a NIC across reboots.

When the system installs we take a snapshot of the hardware and assign vmnicX or vmhbaX names based on the bus/slot/function for each device. If those addresses change across reboot (usually because of a hardware change) what we see is new hardware and hardware having been removed. In that case we name the new hardware with a new name and reserve the old name in case the old hardware comes back. A concrete example:

Address Name

0:4.0 vmnic0

0:5.0 vmnic1

0:10.0 vmnic2

0:12.0 vmnic3

Now in your case you inserted hardware and the bus ordering changed an example of that would be:

0:3.0 vmnic4 (was vmnic2)

0:7.0 vmnic5 (was vmnic0)

0:9.0 vmnic6 (was vmnic3)

0:10.0 vmnic2 (new NIC, but its found in the same address as vmnic2)

0:11.0 vmnic 7 (was vmnic1)

In this case we have both new names and the name of a previous NIC being used by the new hardware added because it was bound to the same PCI address

0 Kudos
Kevin_Brown
Contributor
Contributor

I can understand it tying the visual name to the PCI address to survive a restart, but I don't buy the case of the names are applied sequentially to the physical devices since I have concrete proof that 2 identical servers, built the exact same time have not had any changes done to them have the embedded NICs on the motherboard list as completely different VMNIC numbers. If it was to assign the VMNICxx name based upon physical PCI hardware address, then 2 identical..same manufacturer, motherboard, processors, memory, built with sequential mfg serial numbers, etc etc ... would have to have the same physical PCI hardware address for the embedded NICs, and therefore based upon that reasoning, embedded NIC0 would get VMNIC0 and embedded NIC1 would get VMNIC1...and that is just not the case. I am seeing that in 1 server it did assign that way, and the server that was built exactly the same..entirely identical has an add-in dual-port network (both servers have these too) getting VMNIC0 and VMNIC1 and the embedded ones getting VMNIC2 and VMNIC3.

Regardless of that though...there has to be a relational table that is mapping the visual name to the hardware address that can be modified since as you said, the host OS has made that visual name reservation incase the physical address returns. So is there a way of editing that table such that the VMNICxx adapter to vSwitch mapping can be maintained consistently across my server pools? This will become more important to me over time both from allowing junior people a quick visual way to insure that if there is a "network" problem it is either isolated to the ESX host (one group of support people) or the physical network (another group of people). Additionally, from a monitoring standpoint if one of my nics in the vSwitch that serves my production network goes belly up, that isn't such an issue since it is teamed, on the other hand the NIC that vMotion uses isn't teamed and is important. Trying to train someone who is watching a monitoring screen 12 hours per day that on this server VMNIC3 isn't important, but on that one it is..yeah that is not going to happen. On top of that those people don't have abilities to look at the Virtual Center console and dive into the netowrk configuration, nor should they.

Thanks again.

0 Kudos
admin
Immortal
Immortal

Since you upgraded the system memory it appears that the motherboards in your system take the Memory layout into account when assigning PCI bus/slot/func to each device. Thats something I've not seen before, but its not particularly surprising.

As I pointed out above sequential naming is used, but with a number of caveats.

*) On initial install the PCI devices will be numbered sequentially by bus then slot then function

*) Sequential refers to the addressing of the PCI devices, not their physical location. I've seen lots of cases where slots 3 and 4 had lower PCI addresses than 1 and 2

*) If the system detects a hardware renumbering it will attempt to keep old device names if possible. That means if it used to have a NIC device at 4:2:0 with the name vmnic3 and on reboot it again finds a NIC at that address (even if it is not the same physical NIC) it will assign that name to the device.

As far as changing those names there is no supported way of doing that.

0 Kudos
Kevin_Brown
Contributor
Contributor

Well the scenario I provided before was the situation prior to the memory being upgraded. In other words, the 2 servers were not showing consistent VMNICxx naming before the memory was upgraded. All 3 servers did rename 2 NICs to VMNIC6 and VMNIC7 after the memory upgrade. What ever they were before I can only guess by the fact that upon complete restart vSwitch2 had no NICs assigned to it on the 3, and the "missing" VMNIC numbers varied from being either 2/3, 1/2, or 0/1. Only 2 of those servers I can claim as exact clones though...1 of them is an earlier purchase..same model, just purchased at another time...so I can't definitively say it was identical. The other 2 are identical though. The problem I am having though is the first caveat you list...if I have 2 identical servers, with the same hardware installed in the same slots, then I would expect the assigned names to be identical. This is not what I am seeing, but I can not confirm what slots the add-in cards wre inserted in, so although I completely agree with the 2nd caveat, typically I have seen embedded devices to have "lower" PCI addresses then the expansion slots. If that holds, I would expect to always see the embedded NICs to get VMNIC0 and VMNIC1, which is the case on only 1 of the 4 servers..and that is the first one..the one which had no issues.

Additionally on this like I have said before, there are 4 vSwitchs on each server, and only 1 vSwitch with both of it's associated physical network ports were effected (I believe it is a dual port ethernet NIC, so therefore I would assume logically that the PCI address is laregely the same). I could understand the additional memory causing the reconfiguration because the memory management table is obviously going to have to grow since a large amount of memory was added...but I would have been more ok with this if all of the NICs were forced to new addresses as opposed to only 1 dual port NIC.

Is there a way of cleaning out that relational table that is supported? In knowledge base article id 2243 it provides using the esxcfg-vswitch command to "remove any netowrk adapters that have been renamed" and add network adapters "giving them the correct names" What is meant by that? What is that command effectively doing, re-linking the names to device ID?

0 Kudos
admin
Immortal
Immortal

To your point about renaming everything vs renaming only parts. Its impossible for us to tell the difference between someone adding or removing a new NIC (where you just want to give 1 NIC a new name and keep all the others) vs the case where the system renumbers the PCI bus entirely. Its case of best effort here, we try to maintain as many of the names as we can to make sure the normal case, where a single hardware devices is added/removed, the naming is kept the same and minimize the renaming

Like I said above there is no supported way to change the name of NICs. The KB refers to using esxcfg-vswitch to remove the old names from the switch and relink the NICs with their new names.

Kevin_Brown
Contributor
Contributor

There should be a supported way of keeping NIC names consistent. Maintaining standards in large ESX server deployments is a critical necessity, and it seems that VMWare has thought enough to try and keep NIC renaming to a minimum, but no way to manipulate that table so that if a change has to be made it can still remain consistent within the environment. I think this is a serious short coming to this solution.

0 Kudos
admin
Immortal
Immortal

I'd suggest bringing this up with your VMware support or sales rep. Thats the best channel for this sort of feedback.

0 Kudos