VMware Networking Community
MKiefer
Contributor
Contributor

NSX-T alarm "Application crashed"

I have just upgraded my lab from 3.2.1 to 4.0. It did not go as smooth as I expected but I got through.

Now I have  strange error (at least it is to me) where an alarm constant comes back with "Application crashed"

Skærmbillede 2022-08-16 kl. 11.05.25.png

When tailing  /var/log/syslog I have these entries, but I am not sure this is related or a different error:

2022-08-15T11:53:57.155Z nsxt.lab.local NSX 32752 - [nsx@6876 comp="nsx-manager" s2comp="nsx-net" tid="32761" level="WARNING"] StreamConnection[8712 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:8712] Couldn't connect to 'unix:///var/run/vmware/nestdb/nestdb-server.sock' (error: 2-No such file or directory)

2022-08-15T11:53:57.156Z nsxt.lab.local NSX 32752 - [nsx@6876 comp="nsx-manager" s2comp="nsx-net" tid="32761" level="WARNING"] StreamConnection[8712 Error to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:-1] Error 2-No such file or directory

2022-08-15T11:53:57.156Z nsxt.lab.local NSX 32752 - [nsx@6876 comp="nsx-manager" s2comp="nsx-rpc" tid="32761" level="WARNING"] RpcConnection[8712 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock 0] Couldn't connect to unix:///var/run/vmware/nestdb/nestdb-server.sock (error: 2-No such file or directory)

2022-08-15T11:53:57.156Z nsxt.lab.local NSX 32752 - [nsx@6876 comp="nsx-manager" s2comp="nsx-rpc" tid="32761" level="WARNING"] RpcTransport[0] Unable to connect to unix:///var/run/vmware/nestdb/nestdb-server.sock: 2-No such file or directory

2022-08-15T11:53:57.156Z nsxt.lab.local NSX 32752 - [nsx@6876 comp="nsx-manager" s2comp="nestdb-client" tid="32752" level="WARNING"] NestDbClient: failed to get stub to unix:///var/run/vmware/nestdb/nestdb-server.sock, retrying in 5000 ms...

Looks like the service nestdb is missing?

I do know this is a lab environment, but I am bit uncertain about going to upgrade customer installations as long as I do not know what this error is about, and would like to be certain before doing any :winking_face:

Any idea on what this could be is highly appreciated! 

 

Labels (1)
11 Replies
Mounir-Na
Contributor
Contributor

Hello, I have currently the same error in my Lab-Stage as well! May I ask you, if the GENEVE Tunnels can be built if you have running VM on NSX-T Segment? I am not sure whether this error has any effect on building GENEVE Tunnels.

Reply
0 Kudos
MKiefer
Contributor
Contributor

I don't see any problems in the tunnels, all are up. And the VMs can communicate.

/Martin

Reply
0 Kudos
Filin_K
VMware Employee
VMware Employee

I have the same error as well...

Reply
0 Kudos
svenknockaert1
Contributor
Contributor

I also have the same error here after upgrade from 3.2 to 4.0

Reply
0 Kudos
leotaglietti
Enthusiast
Enthusiast

I also have the same error with the same NSX version.

in my case, I did observe that the ESXi hosts /var/core/ had and sfcb service core dump. All transport node had the same core dump with the same date/hour.

As it is a production environment, I opened a VMware support case to read the core dump service

Reply
0 Kudos
MKiefer
Contributor
Contributor

Please let us know the outcome of the SR!

I tried to power down the NSX manager and restore from a backup. But that did not work either. It just came back with an "internal server error" when attempting to restore the database. I would have thought the backup process did a integrity check of the backup before giving a "completed successfully"... but no..

I have instead just deleted the NSX manager instance that gave the error and deployed aa new instance, and the error is gone.

/Martin

Reply
0 Kudos
chimmez
Contributor
Contributor

Hi leotaglietti,

 

Are you able to share the result of your interaction with support (or pm the ID of the service request)? I'm having the same / a very similar issue in my environment. I'm seeing the same errors appearing in my logs, and I'm also seeing various components of the UI not displaying correctly (for instance, the Policy UI doesn't show dfw statistics, whereas the manager UI does).

I've got a support call open at the moment as well, however the support agent is requesting ~10gb of logs, which I can't provide due to the location of the environment.

Reply
0 Kudos
leotaglietti
Enthusiast
Enthusiast

Hello Team.

 

I have this problem solved with VMware GSS recently. I raised a support request to NSX Team and the NSX Support Engineer told me that NSX was ok and he didn't find any problem. With NSX Support Engineer we identified that the "Application crashed" error which was showing up on NSX Manager GUI was related to one core dump on /var/core/ of each transport node (ESXi host). Once this was a service core dump this wasn't a human-readable file so the VMware GSS ESXi read the service core dump and they saw that someone HPe module called smx was causing the core dump. 

We uninstalled this smx module (which we weren't using) and the problem stopped being present. 

PS: to stop application crash message you must to move or delete the core dump file on /var/core/ 

I could observe an interesting thing about this core dump: The core dump was generated only when the ESXi host was prepared for being NSX Transport Node or when we choose to remove NSX from the host. After removing the smx module the core dump stops to be showing up during ESXi preparation host and during the process to remove too :slightly_smiling_face:

Maybe the problem with you environment could be another module so I suggest you to check /var/core/ partition on ESXi host and see if this is generating during ESXi prepare process to be a NSX Transport node or during the NSX removing process.

 

 

Reply
0 Kudos
MKiefer
Contributor
Contributor

I am using Cisco UCS servers and they do not have that module installed. So I am not sure that this was causing the issue on my lab.

I have not seen the problem since I redeployed the NSX node.

 

 

Reply
0 Kudos
romansf
Contributor
Contributor

for me it was due to stale coredumps which been there for some time already

root@nsx-mgr3:~# ls -al /var/dump
total 1440
drwxrwx--- 3 root nsx 4096 Sep 23 06:12 .
drwxr-xr-x 15 root root 4096 Oct 9 05:22 ..
-rw-rw-rw- 1 root root 688915 Aug 7 2021 core.snmpd.1628373629.1091.0.11.gz
-rw-r--r-- 1 root root 757597 Sep 23 06:12 core.snmpd.1663913558.7701.0.11.gz
drwx------ 2 root root 16384 Feb 13 2020 lost+found

it seems that NSX 4.0 reports this as an alarm which was not previously the case

deleted these gz files and resolved the alarm

Reply
0 Kudos
dev_anamani
Contributor
Contributor

run following command from the NSX Manager having this alarm.

del core-dump all

Reply
0 Kudos