Last vSAN 6 periodically fail vSAN Health service....

malefik · ‎01-18-2018

Hello colleagues!

ESXi-6.5.0-7388607, VCSA-6.5.0-7515524

We have 4 servers in vSAN. Periodically (it is not clear why) different of them have a problem:

At the same time, the virtual machines on it work normally. We try to transfer the host to the maintenance mode, but fail:

It helps only reboot the host without maintenance mode, enter into the maintenance mode again and restart again.

Also we see a certain "vSAN SCSI Target", but in our cluster this option is disabled!

Any help?

TheBobkin · ‎01-18-2018

Hello malefik,

That's a really strange one - I would advise taking a better look at the state of the host when it is in this state, potentially clomd is not running which would explain not being able to detect the host version (and thus assuming it is prior to 6.0 U2) or other services necessary for full use of vSAN.

Is it always the same host that experiences this issue?

Has this issue occurred on other builds?

Have you tried putting the host in Maintenance Mode (MM) from the CLI? (So as to rule out any potential vCenter communication issues):

# localcli system maintenanceMode set -e 1 -m ensureObjectAccessibility

Do those 'Edit vSAN SCSI Target service' jobs ever complete? There are a lot of 'tasks' which run on vSAN as checks of host/cluster capabilities and configuration regardless of whether the feature is in use - these could be hung due to communication with that host, in that case I would advise restarting hostd or otherwise killing the tasks and trying to enter MM again.

Do you have any host logs (clomd.log, vmkernel.log, vobd.log, vmkwarning.log) from the last time this occurred that you could attach here?

Bob

malefik · ‎01-19-2018

Well, Bob, we again had a problem yesterday. With different host. Please help!

1. "potentially clomd is not running"

No, I run: /etc/init.d/clomd status - "clomd is running"

2. "Is it always the same host that experiences this issue?"

No, a different hosts having this problem in different time.

3. "Have you tried putting the host in Maintenance Mode (MM) from the CLI"

Same problem: message "A general system error occurred: HTTP error response: Service Unavailable"

4. "Do those 'Edit vSAN SCSI Target service' jobs ever complete?"

Yes, they never end up being executed. Progress freezes by 50%

5. "Do you have any host logs"

Yes. Attach.

TheBobkin · ‎01-20-2018

Hello malefik,

It appears services hostd, vpxa and netcpa are having issues as otherwise we shouldn't be seeing these core dumping- the former two would explain hung tasks and unresponsive hosts from the vSphere management perspective and health check not working, I am not familiar enough with NSX to say if netcpa having issues is a symptom, side-effect or possible cause:

2018-01-19T21:05:56.930Z cpu21:69685 opID=ca5e77d3)World: 12235: VC opID 8ae7ebca-fd5c-11e7-8d03 maps to vmkernel opID ca5e77d3

2018-01-19T21:05:56.930Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.931Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.933Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.934Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.936Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.937Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.940Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.941Z cpu21:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:05:56.943Z cpu22:69685 opID=ca5e77d3)WARNING: LinuxThread: 381: hostd-worker: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.227Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.228Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.229Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.230Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.231Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.232Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:13.233Z cpu26:67952)WARNING: LinuxThread: 381: python: Error cloning thread: -12 (bad0014)

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb6aa8c43de2 repeated 6 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eaf7a1e6b4f1 repeated 8 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb93ab39bfbf repeated 1 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb11a3759fe4 repeated 9 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb23a4945e90 repeated 9 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb41a6561112 repeated 3 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eacb9f4afc94 repeated 5 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eb55a7835483 repeated 3 times

2018-01-19T21:06:33.098Z cpu36:66381)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.600304801bf13b011fd1eae2a0b10467 repeated 8 times

2018-01-19T21:06:34.393Z cpu19:9371604)ALERT: Warning - NETCPA getting netcpa status failed!

2018-01-19T21:08:04.426Z cpu20:9371631)ALERT: Warning - NETCPA getting netcpa status failed!

2018-01-19T21:09:34.460Z cpu20:9371797)ALERT: Warning - NETCPA getting netcpa status failed!

2018-01-19T21:16:57.633Z cpu5:6418654)User: 3089: vpxa-worker: wantCoreDump:vpxa-worker signal:6 exitCode:0 coredump:enabled

2018-01-19T21:16:57.793Z cpu5:6418654)UserDump: 3024: vpxa-worker: Dumping cartel 69492 (from world 6418654) to file /var/core/vpxa-zdump.000 ...

2018-01-19T21:17:04.634Z cpu20:9372016)ALERT: Warning - NETCPA getting netcpa status failed!

2018-01-19T21:17:06.687Z cpu12:6418654)UserDump: 3172: vpxa-worker: Userworld(vpxa-worker) coredump complete.

2018-01-19T21:17:07.268Z cpu20:9372082)WARNING: LinuxThread: 381: sh: Error cloning thread: -12 (bad0014)

2018-01-19T21:17:08.651Z cpu38:69655)User: 3089: hostd-worker: wantCoreDump:hostd-worker signal:6 exitCode:0 coredump:enabled

2018-01-19T21:17:08.820Z cpu38:69655)UserDump: 3024: hostd-worker: Dumping cartel 68947 (from world 69655) to file /var/core/hostd-worker-zdump.000 ...

2018-01-19T21:17:15.355Z cpu21:9372084)WARNING: LinuxThread: 381: sh: Error cloning thread: -12 (bad0014)

2018-01-19T21:17:17.275Z cpu20:9372086)WARNING: LinuxThread: 381: sh: Error cloning thread: -12 (bad0014)

2018-01-19T21:17:27.281Z cpu3:9372088)WARNING: LinuxThread: 381: sh: Error cloning thread: -12 (bad0014)

2018-01-19T21:17:31.870Z cpu29:69655)UserDump: 3172: hostd-worker: Userworld(hostd-worker) coredump complete.

The only external reference to the initial netcpa warning messages are below and state to restart netcpad:

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.troubleshooting.doc/GUID-8DEA45...

There are thousands of lines of what looks like NSX reconfiguring something after services core dump - I don't know if this is expected behaviour or not.

Are these HP servers?

When you tried entering MM via the CLI did you use localcli or esxcli?

I would advise opening an SR with VMware GSS to look at the generated dump files to start figuring out what element of the configuration is having issues here, I don't see anything alarming in the clomd.log (e.g. node becoming partitioned or changes to Object/component availability) so I wouldn't suspect a vSAN issue.

Bob

All

Last vSAN 6 periodically fail vSAN Health service. WTF?