VCSA/ESXi 6.7u3 - 503 Service Unavailable, vpxd-we...

DanPaLewis · ‎12-09-2019

Hello all,

I have been attempting to rack my brain on this one but I haven't been really able to make any progress. I noted that awhile after upgrading to VCSA 6.7u3, I am unable to login to the vSphere Manager. When attempting to access the link by IP address, I get the following:

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x00007fa2d00e50a0] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

One of the things that I noted keeps happening is that the vmware-vpxd service will seem to crash. I've attempted to look at the vpxd.log file, however nothing really jumps out at me why this would be crashing.

I'm not the most experienced with attempting to navigate via command line, but I am more than willing to try and learn.

I have tried the following:

Check space usage using df -h
- The closest item to 100% is /storage/seat @ 95%
Restart Services - This works for a few minutes until vpxd crashes again.
Reset the root password.
Update certificates (I have to try to get the information again. This may have been performed incorrectly).

What can we do to help troubleshoot this?

sjesse · ‎12-09-2019

Check these logs

vSphere Client Logs

the vphere_client_virgo.log might show something

DanPaLewis · ‎12-09-2019

Thanks sjesse! Unfortunately this is a MASSIVE log file!!! About ~31mb. I'm going to poke through that to see if I'm able to see anything worthwhile in there.

sjesse · ‎12-09-2019

Logoon to the vcsa with ssh, go to the location and do tail -f the log file. That will show you the log entries as written as you login, it should help you find things quicker.

PatrickDLong · ‎12-09-2019

DanPaLewis There is a bug in 6.7u3 that causes a flood of host hardware health errors. The reason you get the "503 Service unavailable" message is that because of these event messages your SEAT disk grew to 95% full - you will need to perform the remediation steps in KB74607 that describe how to truncate event tables on your VCSA in order to free disk space on your SEAT disk. and follow the other recommendations in that KB. Needless to say, this is a delicate operation. Engage VMware support if you have questions about how to proceed.

And then after you resolve the SEAT disk space issue, update your vCenter to latest as well as updating your ESXi 6.7U3 hosts to either:

ESXi 6.7 U3a November 2019 Patch ESXi670-201911001 2019-11-12 build 15018017

ESXi 6.7 U3b December 2019 Patch ESXi670-201912001 2019-12-05 build 15160138

Good Luck!

DanPaLewis · ‎12-10-2019

Patrick:

Thanks for the link. I am going to be reviewing this information and applying fixes.

Question: Is there a way to tell which host is having the issue? Or is it a safe assumption that since everyone of my hosts are 14320388 that I will have to follow this remediation on every single one?

PatrickDLong · ‎12-10-2019

You will need to follow the steps in the KB to recover space on your SEAT disk on VCSA. This will allow vpxd service to start and stay running. As I said, this process involves directly truncating tables in the Postgres db on VCSA so proceed with extreme caution and follow the KB instructions explicitly or get VMware support to help you with this if you're not comfortable. The usual "have a backup", etc. caveats apply.

Then you will need to upgrade ALL of your 14320388 hosts to either 15018017 or 15160138 to resolve the issue of them generating spurious health alert messages to your VCSA - which is what is filling up the SEAT disk and causing vpxd to crash. vpxd will not run if the SEAT disk is >= 95% full. Incidentally, don't worry about the /storage/archive mount showing 100% full - that is an expected and desired status.

You can easily see the spurious messages on each host if you look in the vSphere client at <select a host> and in the right pane select Monitor >> Events. You will see a large number of health events happening continuously on every 14320388 host.

DanPaLewis · ‎12-10-2019

Patrick,

Can the truncation of the DB be done after the fact? Due to the fact that our VCSA resides on a different VCSA, I was able to increase the drive space of the SEAT partition, which has left us at 65% full (and can be increased more) and VPXD has been running successfully... if we're able to maintain stability and upgrade first, then it doesn't matter what order this is handled in right?

I am taking full precautions when it comes to this as I do not believe I have a way to contact VMware Support right now... I'm left in quite the sticky situation.

PatrickDLong · ‎12-10-2019

If you have already extended the SEAT disk on the VCSA, my opinion would be that you can just start upgrading hosts and the volume of events coming into your VCSA SEAT disk will decrease proportionally. It really depends on the size of your environment as to how fast that SEAT disk capacity gets chewed up. I'm not 100% sure but I would assume that after your hosts are upgraded that the spurious events that are already taking up space in your SEAT disk right now would eventually age out and the space would be reclaimed, obviating the need for manual truncation.

I ran into this issue in early Sept, then extended my SEAT disk once by 30GB to get more free space - hoping to ride it out until a patch was released, but then still ended up having to truncate the tables every 4-6 days for weeks because the15018012 patch with the fix was not released until 11/12/2019 and the daily volume of SEAT data coming in chewed through the newly available space rather quickly. So the truncation was basically a weekly task for me while waiting for the patch release.

DanPaLewis · ‎12-10-2019

Patrick,

Thanks for confirming my suspicions regarding being able to go ahead and update... I tried to do this. And originally updating the VCSA was successful. However, that has since changed and we are having other issues.

Update Manager is nowhere to be seen on the latest version of vSphere (6.7.0.42000-15132721). I have since attempted to reboot the VCSA in hopes that would resolve it, and now I am having issue number 2.
Since rebooting, I have been having problems bringing the VCSA back online. The fortunate thing here is that this VCSA is a VM guest on another VCSA, so I can at least control it via that method. However, since rebooting the VCSA I can't ever log back into the web console. No error, no nothing; just timeouts. I can't SSH into the device either.

We're getting dangerously close to me having to contact VMware, but with how siloed my corporation is, we don't even know who to contact who can add my account to our VMware Support Contract (which I at least have the ELN of).

One step forward... two steps back.

Edit: For some reason our vSphere setup disconnected the network of the VCSA that we're working on here, so this is why I couldn't navigate to the web console, SSH, or ping... that has been resolved. I am able to SSH onto the box, but I am back to getting the same error message that caused this issue before. I have confirmed using df -h that SEAT is not full, and is only at 70%. When attempting to run service-control --start --all, I get the following error message:

service-control --start --all
Operation not cancellable. Please wait for it to finish...
Performing start operation on service lwsmd...
Successfully started service lwsmd
Performing start operation on service vmafdd...
Successfully started service vmafdd
Performing start operation on service vmdird...
Successfully started service vmdird
Performing start operation on service vmcad...
Successfully started service vmcad
Performing start operation on service vmware-sts-idmd...
Successfully started service vmware-sts-idmd
Performing start operation on service vmware-stsd...
Successfully started service vmware-stsd
Performing start operation on service vmdnsd...
Successfully started service vmdnsd
Performing start operation on profile: ALL...
Service-control failed. Error: Failed to start services in profile ALL. RC=1, stderr=Failed to start sca, vapi-endpoint, vmonapi, vpxd-svcs services. Error: Operation timed out

Question: Is there a way I can update the 10 ESXi hosts that I have without using the VCSA Update Manager, in a somewhat easy way? If I can resolve this issue by updating the ESXi hosts, I can then worry about the VUM later.

Thanks

PatrickDLong · ‎12-11-2019

We're starting to get into the "When you're digging yourself into a hole - quit digging" territory. Of course there are CLI methods to update your hosts via ZIP files, but without the benefit of VCSA running you have no easy way to clear off your hosts to perform the host upgrade. I would strongly recommend contacting VMware Support and working through the VCSA issues to get it back online.

I'm sorry this did not go smoothly for you. I also wish there was better QA happening for the ESXI releases. IMO an exponential increase in health event logging growth is something that should have been caught prior to publicly releasing 6.7U3.

All

VCSA/ESXi 6.7u3 - 503 Service Unavailable, vpxd-webserver-pipe, vmware-vpxd crashing