robert_23
Contributor
Contributor

VCSA HA eth0 and services dont start after reboot

Hi,

I configured in our test environment a VCSA 6.5 HA deployment. All three node were up and fine, vcenter showed them green at HA status. First problem: After a simultaneous reboot of all three machines, the eth0 interface of master and passive node stays down. Second problem: even after configuring eth0 manually and successfully regaining IP connectivity on public interface, the vcsa services can't be started:

Command> service-control --status

Running:

vmware-statsmonitor vmware-vcha vmware-vmon

Stopped:

applmgmt lwsmd pschealth vmafdd vmcad vmcam vmdird vmdnsd vmonapi vmware-cis-license vmware-cm vmware-content-library vmware-eam vmware-imagebuilder vmware-mbcs vmware-netdumper vmware-perfcharts vmware-psc-client vmware-rbd-watchdog vmware-rhttpproxy vmware-sca vmware-sps vmware-sts-idmd vmware-stsd vmware-updatemgr vmware-vapi-endpoint vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vmware-vsan-health vmware-vsm vsphere-client vsphere-ui

Command> service-control --start --all

Perform start operation. vmon_profile=HACore, svc_names=None, include_coreossvcs=True, include_leafossvcs=False

2017-01-13T13:30:25.695Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'lwsmd']

2017-01-13T13:30:25.698Z   Done running command

Service lwsmd startup type is not automatic. Skip

2017-01-13T13:30:25.701Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmafdd']

2017-01-13T13:30:25.703Z   Done running command

Service vmafdd startup type is not automatic. Skip

2017-01-13T13:30:25.705Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmdird']

2017-01-13T13:30:25.707Z   Done running command

Service vmdird startup type is not automatic. Skip

2017-01-13T13:30:25.710Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmcad']

2017-01-13T13:30:25.712Z   Done running command

Service vmcad startup type is not automatic. Skip

2017-01-13T13:30:25.714Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmware-sts-idmd']

2017-01-13T13:30:25.716Z   Done running command

Service vmware-sts-idmd startup type is not automatic. Skip

2017-01-13T13:30:25.719Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmware-stsd']

2017-01-13T13:30:25.721Z   Done running command

Service vmware-stsd startup type is not automatic. Skip

2017-01-13T13:30:25.723Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmdnsd']

2017-01-13T13:30:25.726Z   Done running command

Service vmdnsd startup type is not automatic. Skip

2017-01-13T13:30:25.728Z   Running command: ['/usr/bin/systemctl', 'is-enabled', u'vmware-psc-client']

2017-01-13T13:30:25.730Z   Done running command

Service vmware-psc-client startup type is not automatic. Skip

Successfully started vmon services. Profile HACore.

/etc/systemd/network/10-eth0.network.manual:

[Match]

Name=eth0

[Network]

Gateway=10.45.128.1

Address=10.45.128.32/24

DHCP=no

[DHCP]

UseDNS=false

/etc/systemd/network/10-eth1.network

[Match]

Name=eth1

[Network]

Address=192.168.64.204/23

DHCP=no

[DHCP]

UseDNS=false

Now I shut the passive node down to get active node and witness node up again but the problems still persist. Pinging between active and witness node HA interfaces works. Why are both eth0 down after boot and why cant i start the services?

edit:

networkctl status eth0

● 2: eth0

       Link File: /usr/lib/systemd/network/99-default.link

    Network File: n/a

            Type: ether

           State: routable (unmanaged)

            Path: pci-0000:03:00.0

          Driver: vmxnet3

          Vendor: VMware

           Model: VMXNET3 Ethernet Controller

      HW Address: 00:0c:29:7e:7e:08 (VMware, Inc.)

             MTU: 1500

         Address: 10.45.128.32

         Gateway: 10.45.128.1 (ICANN, IANA Department)

Network file n/a and State: unmanged??? But I have file /etc/systemd/network/10-eth0.network.manual which was created by vcenter. How can I fix this?

Tags (2)
3 Replies
sscoodd
Contributor
Contributor

Same thing exactly after failed HA cluster installation.

I destroyed failed cluster, deleted passive and witness nodes, restarted active node and lost management IP.

Done some experiments to /etc/systemd/network/10-eth0.network.manual

root@vcsa1 [ /etc/systemd/network ]# mv 10-eth0.network.manual 10-eth0.network

root@vcsa1 [ /etc/systemd/network ]# systemctl restart systemd-networkd

root@vcsa1 [ /etc/systemd/network ]# networkctl

Got this:

IDX LINK            TYPE              OPERATIONAL SETUP

  1 lo              loopback          carrier    unmanaged

  2 eth0            ether              routable    configured


eth0 is up, pings fine.

After reboot - network's down again and 10-eth0.network got renamed back to 10-eth0.network.manual

Fine.

root@vcsa1 [ /etc/systemd/network ]# cp 10-eth0.network.manual 20-eth0.network

root@vcsa1 [ /etc/systemd/network ]# systemctl restart systemd-networkd


Network is up.

Rebooting vcsa, while pinging it from outside. It starts up, i see couple of pings and no pings again.

Checking renamed 20-eth0.network is still there, but something brings network down during boot.

root@vcsa1 [ /etc/systemd/network ]# systemctl restart systemd-networkd

Makes it work again.

Just out of curiosity:

root@vcsa1 [ /etc/systemd/network ]# /etc/rc.d/init.d/network start

Starting network (via systemctl):  Job for network.service failed because the control process exited with error code. See "systemctl status network.service" and "journalctl -xe" for details.

                                                          [FAILED]


root@vcsa1 [ /etc/systemd/network ]# systemctl status network.service


● network.service - LSB: Bring up/down networking

  Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: enabled)

  Active: failed (Result: exit-code) since Wed 2017-02-08 10:28:37 UTC; 10s ago

    Docs: man:systemd-sysv-generator(8)

  Process: 2345 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=6)

Feb 08 10:28:37 vcsa1.lab.local systemd[1]: Starting LSB: Bring up/down networking...

Feb 08 10:28:37 vcsa1.lab.local systemd[1]: network.service: Control process exited, code=exited status=6

Feb 08 10:28:37 vcsa1.lab.local systemd[1]: Failed to start LSB: Bring up/down networking.

Feb 08 10:28:37 vcsa1.lab.local systemd[1]: network.service: Unit entered failed state.

Feb 08 10:28:37 vcsa1.lab.local systemd[1]: network.service: Failed with result 'exit-code'.

I'm actually willing to fix this, since it's the second time I redeploy vcsa and run into the same problem.

Anyone?

UPDATE.

It seems that the vCenterHA cluster was destroyed incorrectly.

Destroying it manually on the active node did the trick.

root@vcsa1 [ ~ ]# destroy-vcha

Caution: This will remove all vCenter HA related configuration from the current node and it cannot be reused to form a vCenter HA cluster unless this is the Active node.

Confirm to proceed? (y/n): y

logs available at: /var/log/vmware/vcha

2017-02-08T13:31:44.935Z   Successfully updated starttype: DISABLED for service vcha

2017-02-08T13:31:50.644Z   Running command: ['/usr/lib/applmgmt/networking/bin/firewall-reload']

2017-02-08T13:31:50.778Z   Done running command

Skip not found service - vmware-stsd

Skip not found service - vmware-sts-idmd

Skip not found service - vmdnsd

Skip not found service - vmdird

Skip not found service - vmcad

Skip not found service - vmware-psc-client

Reboot and you're done.

billyjoebob12
Contributor
Contributor

+1 for

# destroy-vcha 

0 Kudos
iforbes
Hot Shot
Hot Shot

THANK YOU!!!!! I thought I'd be proactive and deploy VCSA HA. What a nightmare!. VMware needs to pull this feature until it's fixed. Lost eth0 like you guys and spent the last couple hours troubleshooting. Was about to rebuild VCSA when i came across this post. Saved the day Smiley Happy

0 Kudos