NSX-T 3rd node deployment always fails

thills · ‎10-01-2021

Hi there,

I'm deploying a NSX-T cluster in my homelab to gain some experience. So far node 1 and node 2 deployed without a problem, however whenever I attempt to deploy node 3, it results in "Failed to start LSB: Puts a log file pager on virtual consoles. See 'systemctl status console-log.service' for details, which contains:

root@l-nsx-02:~# systemctl status console-log.service
* console-log.service - LSB: Puts a logfile pager on virtual consoles
Loaded: loaded (/etc/init.d/console-log; enabled; vendor preset: enabled)
Active: activating (start-pre) since Fri 2021-10-01 19:00:34 UTC; 1min 31s ago
Docs: man:systemd-sysv-generator(8)
Cntrl PID: 19094 (wait_for_corfu_)
Tasks: 2 (limit: 4915)
CGroup: /system.slice/console-log.service
|-19094 /bin/bash /opt/vmware/bin/wait_for_corfu_layout.sh
`-19656 sleep 5

Oct 01 19:01:19 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:24 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:29 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:34 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:39 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:44 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:49 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:54 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:01:59 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...
Oct 01 19:02:04 l-nsx-02 wait_for_corfu_layout.sh[19094]: Waiting for corfu layout file...

There are also:
Failed to start Proxy / nsx-cluster-boot-manager / phonehome-coordinator, with similar messages "See 'systemctl status <service>' for details"

All of which contain

root@l-nsx-02:~# systemctl status proxy.service
* proxy.service - proxy: VMware NSX reverse-proxy API server
   Loaded: loaded (/etc/init.d/proxy; enabled; vendor preset: enabled)
   Active: activating (start-pre) since Fri 2021-10-01 19:02:04 UTC; 2min 28s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 19451 ExecStopPost=/etc/init.d/proxy poststop (code=exited, status=0/SUCCESS)
Cntrl PID: 19639 (proxy)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/proxy.service
           |-19639 /bin/sh /etc/init.d/proxy prestart
           `-20554 sleep 5

Oct 01 19:03:45 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:03:50 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:03:55 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:00 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:05 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:10 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:15 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:20 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:25 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...
Oct 01 19:04:30 l-nsx-02 proxy[19639]: Waiting for CBM to lift the barrier file...

This only affects the third node, if I delete node 2, and use node 3's information (ip name etc) it deploys, but then when I attempt to deploy node 3 with node 2's info, it results in exactly the same issue.

This is on a 7.0.1 U2 stack, using NSX-t 3.0.1.

So far I thought the ova I downloaded may have been corrupted as it was downloaded on wireless, I've since removed the deployment completely including removing the registration on the vcsa, and started with a fresh wired downloaded ova to end up with the same result.

What logs am I missing?

Also because the deployment wizard states sync ntp, I've checked with watch "ntpq -p 127.0.0.1" on the vcsa, esxi hosts, as well as the 2 nsx management nodes, and all devices are within a few ms of each other, using 2 on prem ntp sources, as well as 3 off prem 1.north-america.pool.ntp.org devices.

Thoughts, suggestions appreciated: