vCenter 6.7.0.21000 Build 11726888
vCenter vPostgres SQL Keeps eating all of the vcenter memory.
CPU and Memory gets pegged and vcenter goes unresponsive and eventually the web services crash.
Sounds like you need to open an SR here.
Tell me about it. LOL
The appliance isn't swapping and still has more than 500% IOWAIT? At first glance I would assume that the storage is too slow and therefore the vPostgres database has some problems with queries that cannot be processed fast enough. But it's just a guess. I would also recommend to open a SR: https://my.vmware.com/group/vmware/get-help
I like youre thinking. That is what I thought and I moved the vcenter from a VSAN DataStore to a Datrium DataStore. But the problems keep happening. It is really weird. I am thinking of moving the vcenter to some Tried and True FC storage running on Kaminario. The Datrium has been super fast unless there is some networking issues I am not seeing. I have vRealize Log Insight and Vops and nothing really shows up from the networking side. It seems to be an out of control SQL Process. Like a bad DB Schema or something.
Can you attach latest postgresql log from /var/log/vmware/vpostgres
Which monitoring tool are you using?
There are 100's of connections to DB:
2019-04-05 17:08:31.617 UTC 5ca78b8f.9110 0 VCDB vc FATAL: remaining connection slots are reserved for non-replication superuser connections
2019-04-05 17:08:31.618 UTC 5ca78b8f.9111 0 VCDB vc FATAL: remaining connection slots are reserved for non-replication superuser connections
2019-04-05 17:08:31.618 UTC 5ca78b8f.9112 0 VCDB vc FATAL: remaining connection slots are reserved for non-replication superuser connections
2019-04-05 17:08:31.619 UTC 5ca78b8f.9113 0 VCDB vc FATAL: remaining connection slots are reserved for non-replication superuser connections
2019-04-05 17:08:31.677 UTC 5ca78b8f.9114 0 VCDB vc FATAL: remaining connection slots are reserved for non-replication superuser connections
Please share output of below command:
cat /storage/db/vpostgres/postgresql.conf | grep -i "max_connections"
netstat -tulnap | grep -i 443 --> And check if there are several connections from specific IP
cat /storage/db/vpostgres/postgresql.conf | grep -i "max_connections"
max_connections = 100 # (change requires restart)
root@vc-irv [ ~ ]# netstat -tulnap | grep -i 443
tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 2390/rhttpproxy
tcp 0 0 0.0.0.0:5443 0.0.0.0:* LISTEN 2548/vsphere-ui.lau
tcp 0 0 0.0.0.0:9443 0.0.0.0:* LISTEN 2547/vsphere-client
tcp 1 0 127.0.0.1:33700 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 0 0 127.0.0.1:47412 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:42442 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:5432 127.0.0.1:44330 CLOSE_WAIT 35050/postgres: vc
tcp 0 0 10.10.98.47:44434 10.10.67.16:9543 ESTABLISHED 1439/liagent
tcp 0 0 127.0.0.1:443 127.0.0.1:47584 ESTABLISHED 2390/rhttpproxy
tcp 0 0 127.0.0.1:443 127.0.0.1:47838 TIME_WAIT -
tcp 0 0 127.0.0.1:47578 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:47032 127.0.0.1:443 CLOSE_WAIT 3844/python
tcp 1 0 127.0.0.1:5432 127.0.0.1:44302 CLOSE_WAIT 35036/postgres: vc
tcp 0 0 127.0.0.1:46702 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:33692 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:33694 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:52374 127.0.0.1:443 CLOSE_WAIT 2548/vsphere-ui.lau
tcp 0 0 127.0.0.1:47460 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:33698 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 0 0 10.10.98.47:443 10.10.98.124:50626 TIME_WAIT -
tcp 1 0 127.0.0.1:60624 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 0 0 127.0.0.1:47628 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:46542 127.0.0.1:443 CLOSE_WAIT 2548/vsphere-ui.lau
tcp 1 0 127.0.0.1:5432 127.0.0.1:44308 CLOSE_WAIT 35042/postgres: vc
tcp 0 0 127.0.0.1:47282 127.0.0.1:443 ESTABLISHED 5366/vmware-sps.lau
tcp 0 0 127.0.0.1:47752 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:5432 127.0.0.1:44306 CLOSE_WAIT 35039/postgres: vc
tcp 1 0 127.0.0.1:55964 127.0.0.1:443 CLOSE_WAIT 3844/python
tcp 1 0 127.0.0.1:60628 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:33716 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:33650 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:57548 127.0.0.1:443 CLOSE_WAIT 5320/python
tcp 0 0 127.0.0.1:443 127.0.0.1:47526 TIME_WAIT -
tcp 0 0 127.0.0.1:47656 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:47582 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:5432 127.0.0.1:44314 CLOSE_WAIT 35044/postgres: vc
tcp 0 0 127.0.0.1:443 127.0.0.1:47832 TIME_WAIT -
tcp 0 0 127.0.0.1:47734 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:443 127.0.0.1:47448 TIME_WAIT -
tcp 1 0 127.0.0.1:5432 127.0.0.1:44340 CLOSE_WAIT 35057/postgres: vc
tcp 1 0 127.0.0.1:33712 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:5432 127.0.0.1:44338 CLOSE_WAIT 35056/postgres: vc
tcp 0 0 127.0.0.1:443 127.0.0.1:41990 TIME_WAIT -
tcp 1 0 127.0.0.1:41142 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:5432 127.0.0.1:44348 CLOSE_WAIT 35060/postgres: vc
tcp 0 0 127.0.0.1:47648 127.0.0.1:443 TIME_WAIT -
tcp 0 0 10.10.98.47:443 10.10.98.124:50622 TIME_WAIT -
tcp 0 0 127.0.0.1:47746 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:47830 127.0.0.1:443 ESTABLISHED 5366/vmware-sps.lau
tcp 0 0 127.0.0.1:47622 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:443 127.0.0.1:47282 ESTABLISHED 2390/rhttpproxy
tcp 0 0 127.0.0.1:47668 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:47674 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:5432 127.0.0.1:44350 CLOSE_WAIT 35063/postgres: vc
tcp 1 0 127.0.0.1:5432 127.0.0.1:44310 CLOSE_WAIT 35043/postgres: vc
tcp 1 0 127.0.0.1:33690 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:5432 127.0.0.1:44334 CLOSE_WAIT 35053/postgres: vc
tcp 0 0 127.0.0.1:47634 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:5432 127.0.0.1:44303 CLOSE_WAIT 52434/postgres: vc
tcp 1 0 127.0.0.1:60406 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:33710 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 0 0 127.0.0.1:443 127.0.0.1:47830 ESTABLISHED 2390/rhttpproxy
tcp 1 0 127.0.0.1:33706 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:42612 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 32 0 10.10.98.47:55568 184.27.114.65:443 CLOSE_WAIT 5319/updatemgr
tcp 0 0 10.10.98.47:443 10.10.98.124:50631 TIME_WAIT -
tcp 1 0 10.10.98.47:34874 208.91.0.89:443 CLOSE_WAIT 2547/vsphere-client
tcp 1 0 127.0.0.1:33714 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 0 0 127.0.0.1:443 127.0.0.1:47722 TIME_WAIT -
tcp 0 0 127.0.0.1:47584 127.0.0.1:443 ESTABLISHED 5320/python
tcp 1 0 127.0.0.1:5432 127.0.0.1:44346 CLOSE_WAIT 35058/postgres: vc
tcp 1 0 127.0.0.1:56164 127.0.0.1:443 CLOSE_WAIT 5348/vmware-vsm.lau
tcp 0 0 127.0.0.1:443 127.0.0.1:47922 TIME_WAIT -
tcp 0 0 127.0.0.1:47740 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:47610 127.0.0.1:443 TIME_WAIT -
tcp 0 0 127.0.0.1:47662 127.0.0.1:443 TIME_WAIT -
tcp 1 0 127.0.0.1:33696 127.0.0.1:443 CLOSE_WAIT 5400/vmware-content
tcp 1 0 127.0.0.1:5432 127.0.0.1:44324 CLOSE_WAIT 35049/postgres: vc
tcp6 0 0 :::443 :::* LISTEN
I am not familiar with this. Thanks again for all the help.
One way to address this issue is to increase max_connections to 250, restart vCSA and monitor.
If this doesn't help contact VMware support.
Can you educate me on what these connections are for ? I know my team does a lot of automation as this is a lab environment.
Usually this comes from the application-level, one of more services connecting to PostgreSQL are visibly not releasing connections, causing the amount of connections to run out.
Attach vpxd and vpxd-profiler logs, we can track the client IP from these logs.
Very Cool. Finally Something that makes sense. I will give it a go and let you know what happens. Cheers. Making Changes now.
Attach vpxd.log as well.
I've sent you a DM.
Did you ever get this fixed? Mine isnt crashing, but its been nailing CPU.
No we were not able to fix. I do not have vmware support. I would suggest opening a case with vmware support.