grimsrue
Enthusiast
Enthusiast

NSX-T Edge Node 3.1.3.0 upgrade to 3.2.0 Error message

Jump to solution

Already have a crazy error message right from the very start when upgrading the first Edge node to 3.2.0.

This is happening with my Lab NSX-T environment and on the very first Edge Node. Upgrade gets to about 35% and then fails with this crazy error message below. 

I deleted and redeployed my Edge nodes from scratch. I also found that I had incorrect DNS IPs configured on my edges and managers. I corrected the DNS IPs. I am receiving the same exact error before and after the redeploy/DNS IP update. 

Any ideas out there. I'll probably open a SR with VMWare support to see if they can figure it out, but wanted to see if anyone else has run into the same error.

My lab is running some older server hardware, but they are running a fairly updated version of ESXi 6.7. The last round of NSX-T patches from 3.1.2.1 to 3.1.3.3 had no issues. 

 

Error message:
Edge 3.2.0.0.0.19067070/Edge/nub/VMware-NSX-edge-3.2.0.0.0.19067089.nub switch OS task failed on edge TransportNode be263b3e-0610-497e-af01-7d994ee0443a: clientType EDGE , target edge fabric node id be263b3e-0610-497e-af01-7d994ee0443a, return status switch_os execution failed with msg: An unexpected exception occurred: CommandFailedError: Command ['chroot', '/os_bak', '/opt/vmware/nsx-edge/bin/config.py', '--update-only'] returned non-zero code 1: b"lspci:
Unable to load libkmod resources: error -12\nlspci:
Unable to load libkmod resources: error -12\nlspci:
Unable to load libkmod resources: error -12\nlspci:
Unable to load libkmod resources: error -12\nlspci:
Unable to load libkmod resources: error -12\nSystem has not been booted with systemd as init system (PID 1).
Can't operate.\nERROR: Unable to get maintenance mode information\nNsxRpcClient encountered an error: [Errno 2] No such file or directory\nWARNING: Exception reading InbandMgmtInterfaceMsg from nestdb, Command '['/opt/vmware/nsx-nestdb/bin/nestdb-cli', '--json', '--cmd', 'get', 'InbandMgmtInterfaceMsg']' returned non-zero exit status 1.\nERROR: NSX Edge configuration has failed. 1G hugepage support required\n" .

 

The only other idea I have is the "VMware-NSX-upgrade-bundle-3.2.0.0.0.19067070.mub" file got messed up when I was uploading it to the Manager. I am going re-upload it again and give the upgrade another go. Will respond to this post as to if it worked or not.

 

1 Solution

Accepted Solutions
yozzauk
Contributor
Contributor

I'm getting the same problem in my lab which is using Ivy Bridge CPUs (E5-2650 v2) specifically, the 'problem' is that the Ivy Bridge EVC mode hides the PDPE1GB CPU feature (1GB Hugepages) despite being a supported feature since Nehalem. Haswell EVC enables it, I can only get it exposed to a VM by disabling EVC and forcing the feature on with the featMask advanced option. The EVC docs mention that not all features are supported within an EVC level but it's a bit of a cop-out as most of the enterprise CPUs for the previous three architectures supported PDPE1GB for the most part.

It's an interesting situation as Ivy Bridge chips are still on the supported list for ESXi 7.0 and NSX-T 3.2 is interoperable but there's not really a way to run the pair together because of this.

View solution in original post

26 Replies
seunghyunj
Contributor
Contributor

I have the same problem..

So any solution? If you have solved it, please tell me how to solve it

0 Kudos
leechunk
Enthusiast
Enthusiast

Exactly the same issue doing the upgrade.

I gave up and re-deploy NSX-T 3.2 from scratch but the Edge also have issues starting up and is not able to accept any configuration changes from the manger.

I narrowed down to the hugepage support issue. Added the featMask.vm.cpuid.pdpe1gb = "Val:1" in Advanced options on the edge VM, Edge datapath started but it is still not accepting any changes from the manager with the Hugepage not supported error. I am using an Ivy Bridge CPU so I am pretty sure that pdpe1gb CPU flag is supported.

You might want to add the advanced options to your edge VM and retry the upgrade step again.

Cheers

0 Kudos
grimsrue
Enthusiast
Enthusiast

I would normally scrap the current NSX-T lab environment, but I need to see it upgrade successfully before I even attempt to upgrade my production environment.

 

re-uploading the .MUB file did not fix anything. At worse it blew up one of my Edge nodes. I need to redeploy the Edge node that the upgrade messed up

I realized that my lab NSX-T environment was actually at 3.1.3.0 and NOT 3.1.3.3. I am going to try upgrading to 3.1.3.3 and see how that goes. If that goes well then I'll try 3.2.0 again. If I get a failure again then I'll try adding featMask.vm.cpuid.pdpe1gb = "Val:1" to the Advanced options on the Edge VMs and then try upgrading again.

I know that my hardware is getting up their in age in my lab, but I would not think there were that many changes with 3.2 that would keep this new version of NSX-T from running on this older hardware....buuuuut I could be wrong

 

0 Kudos
dstnr
Contributor
Contributor

Hi, I can confirm this issue occurs with the upgrade from version 3.1.3.3 to 3.2.0.

0 Kudos
yozzauk
Contributor
Contributor

I'm getting the same problem in my lab which is using Ivy Bridge CPUs (E5-2650 v2) specifically, the 'problem' is that the Ivy Bridge EVC mode hides the PDPE1GB CPU feature (1GB Hugepages) despite being a supported feature since Nehalem. Haswell EVC enables it, I can only get it exposed to a VM by disabling EVC and forcing the feature on with the featMask advanced option. The EVC docs mention that not all features are supported within an EVC level but it's a bit of a cop-out as most of the enterprise CPUs for the previous three architectures supported PDPE1GB for the most part.

It's an interesting situation as Ivy Bridge chips are still on the supported list for ESXi 7.0 and NSX-T 3.2 is interoperable but there's not really a way to run the pair together because of this.

leechunk
Enthusiast
Enthusiast

I've made some progress for my fresh NSX-T 3.2 setup on Ivy Bridge Esxi Hosts

- Install NSX-T manager

- Do not deploy Edge from the Manager, instead deploy using the Edge ova

- Do not power up the Edge after deploying the Edge ova. Add the featMask.vm.cpuid.pdpe1gb = "Val:1" advance settings to the Edge VM, then power up the VM. You can login to the Edge VM to confirm that the extension is enabled by running "cat /proc/cpuinfo | grep pdpe1gb. If you do not perform this step, you will see a message on the console saying the Edge Datapath cannot start.

- Use the join management-plane to join the edge to the manager

That's it. The Edge will be able to receive configuration changes without complaining the Hugepage is not supported.

Unfortunately, I have destroyed my 3.1 environment so cannot test the upgrade steps. But I think the following should be the correct step (WARNING NOT TESTED)

- Deploy a new Edge if you have previously had a failed Edge upgrade

- Before upgrading the new Edge, ensure you have the featMask.vm.cpuid.pdpe1gb = "Val:1" advance settings

- Proceed with the upgrade

Cheers

Tags (3)
yozzauk
Contributor
Contributor

I can upgrade in the same manner with the downside that I have to disable EVC on the cluster hosting the edge nodes (which I don't want to).

The other point to add there is that if you deploy any new edge nodes from the manager you'll need to shut them down and add the advanced setting for them to work, so I don't see it as a perfect solution.

I've held back on the upgrade for the time being, especially considering VMware's current update release/bug track record.

0 Kudos
engyak
Enthusiast
Enthusiast

What base CPUs are you running? What version of ESXi?

I've verified that `1G Huge Page` support is not showing up with EVC on with an Intel E5-2620 v2 with generic Linux machines (Fedora, Debian) on ESXi 7.0u3 with EVC set to Per-VM. Intel's docs indicate that 1G Huge Page is supported with my CPU.

So in this case it doesn't appear to be an NSX issue but living more in vSphere land. Have you seen any other performance issues?

My solution has been to build and run ETNs on a host with a newer CPU architecture for the interim.

0 Kudos
grimsrue
Enthusiast
Enthusiast

After adding featMask.vm.cpuid.pdpe1gb = "Val:1" to the Advanced config of the Edge VMs I am not seeing pdpe1gb in the Edge Node CPU flags when I run cat /proc/cpuinfo. Not sure if this is normal behavior or not. I have not had to deal with Huge Pages for the CPU before. 

I assume the pdpe1gb must show up in the CPU flags for the upgrade to work correctly? If so it seems that I may need to enable huge pages in the ESXi hosts as well. Trying to do some Google foo to get more info on Huge Page enablement. Most of what I am finding is about openstack and adding "sched.mem.lpage.enable1GPage = TRUE" to the VM Advanced config. 

Lab hardware is one IBM x3650 running Intel Xeon E5-2650 0 CPU and two Dell Poweredge M620 blades running Intel Zeon E5-2650 0 CPU
ESXi OS's are 6.7.0, 16773714

I some how managed to torpedo my NSX-T managers trying to upgrade them to 3.1.3.3. Having to restore from backups which is going to take some time. The Edge Nodes and Transport host upgrades worked with no issue.

0 Kudos
grimsrue
Enthusiast
Enthusiast

Disregard my comment about not seeing the pdpe1gb in the CPU flags. There were so many flags that pdpe1gb got lost in the middle of all of the other flags.....or it showed up AFTER I upgraded the Edge nodes to 3.1.3.3. Not really sure at this point

After I get my Managers back online I'll give the 3.2.0 upgrade another go. 

I feel like I am jumping through hoops to get this upgrade to work correctly. 

0 Kudos
leechunk
Enthusiast
Enthusiast

Intel(R) Xeon(R) CPU E5-2695 v2

ESXi 7u3

My understanding is if you enable EVC for Ivy Bridge, it will not show up.

I tried enabling EVC for Ivy Bridge after my edge VM with the pdpe1gb advanced settings was booted up, under host compatibility, there was no host compatible and the msg was a running VM is using a unsupported CPU feature.

I shutdown the edge VM and was able to successfully enabled EVC for Ivy Bridge. But if I go check back the supported feature set, pgpe1gb was not in the list of 41 features.  I was also not able to power up the edge VM.

So apparently, this feature is supported in Ivy Bridge but VMware decided to turn this off with EVC. But at least you can enable this with advanced settings but the down side is you will have to disable EVC.

Cheers 

0 Kudos
engyak
Enthusiast
Enthusiast

Do we have anyone who has this issue with 7.0u2? My inclination at this point may be biased because 7.0u3 was pulled, but we may have an additional software issue with EVC...

0 Kudos
grimsrue
Enthusiast
Enthusiast

Hello Everyone,

I was finally able to restore my NSX-T managers and finish the upgrade for them to 3.1.3.3 successfully.

I set the (featMask.vm.cpuid.pdpe1gb = "Val:1") option in the advanced config on all of my Edge nodes and rebooted them.

After the Edge nodes rebooted I started the 3.2.0 upgrade. The Edge nodes finally upgraded normally.

ESXi hosts running Ivey bridge processors, the current fix is to add the (featMask.vm.cpuid.pdpe1gb = "Val:1") to the Edge Nodes Advanced config and reboot. 

Verify that the pdpe1gb is listed in the CPU flags for each edge node by running cat /proc/cpuinfo from "root" login.

Thanks @leechunk for the work around. 

mackov83
Enthusiast
Enthusiast

I have the same issue with 7.0u2, though my CPUs are Sandy Bridge. I know they are not supported for vSphere 7.x, but this is a lab environment.

I saw the exact same behaviour though, I had to disable cluster level EVC for the pdpe1g flag to show up in the cat /proc/cpuinfo output

0 Kudos
IvanZito
Contributor
Contributor

featMask.vm.cpuid.pdpe1gb = Val:1 

---------------------------------------------------
Write easy and use the common language.
Not all people are native
Thanks
---------------------------------------------------
0 Kudos
yozzauk
Contributor
Contributor

Another point to we aware of for people appling the workaround is that deploying new edge nodes is now a serious pain in the arse, don't seem to be able to do it from the NSX-T manager UI anymore as you can't modify the setting in time for it not to fail. Testing now with the OVA method instead (which might actually be better as I think I could get it to work with Terraform).

Edit: Correction to this, you can deploy through NSX-T as you can use PowerCLI to set the advanced setting while the edge is deploying which appears to work as 1GB pages are then in place for first boot.

0 Kudos
leechunk
Enthusiast
Enthusiast

Glad it helped someone.

One last point is if you go to the NSX Edge system requirements page (scroll down to the bottom), it clearly state that the Edge VM is supported on E5-xxxx CPU, which is Sandy Bridge or later CPU.

I would expect if this is written in the documentation, the different combination of supported CPUs should be tested by VMware, but sadly, they fail us in this department again.

mackov83
Enthusiast
Enthusiast

Hi all, did anybody have any issues with the upgrade post the ETN issues? I was able to update them and the transport hosts, but when it came to NSX Managers the upgrade failed with very little useful information.

Unexpected error while upgrading upgrade unit: prepareUpgrade timed out

0 Kudos
grimsrue
Enthusiast
Enthusiast

@mackov83 

I have NSX-T in two different lab environments and had no issue with upgrading the Host Transport Nodes or the Managers. My only issue was the Edge node issue above. 

I have had some odd issues pop up from some of the simplest config mistakes. Check all the simple things like default GW, DNS, NTP, host/switch MTU, and storage connectivity, dup IPs, etc. While troubleshooting the edge node issue this past few days I realized that I had configured old DNS server IPs on all of my edge, Hosts, and managers that were decommed last year. That solved a number of other issues I had been dealing with for months.

0 Kudos