Re: Helper process consuming CPU on all vSAN hosts...

JimPhreak · ‎11-14-2016

I've recently setup a 3-node + witness (2 nodes contributing storage) vSAN cluster and I'm noticing that on all 3 physical hosts, the 'helper' process is at 100+%RUN. What could be causing this and how can I troubleshoot to determine the cause?

EDIT: The one alert I see in the vSAN monitoring tab is 'VSAN Disk Balance' but when I manually kick off a rebalance nothing happens (just hangs at 5%).

zdickinson · ‎11-15-2016

Good morning, what version of ESXi are you running? the rebalance at 5% has been discussed in the forums. Mostly the answer is wait, it will finish. If it runs for more than 24 hours, it will stop itself and this is usually related to some underlying issue. Which maybe the same issue that is causing the high CPU. Thank you, Zach.

JimPhreak · ‎11-15-2016

I'm running 6.0U2 on all ESXi hosts. I kicked off a rebalance 3 hours ago and it's still running. Don't see why it should take this long considering there isn't THAT much data to be moved.

*Note* When I first started the re-balance the Disk Usage Above Threshold %'s were 13% and 15% respectively so there was some initial movement but nothing since then.

zdickinson · ‎11-15-2016

This might be a problem with vCenter 6.0 U2. Rebalance Virtual SAN Cluster task stuck at 5%

Here is a link explaining the rebalance operation. https://greatwhitetec.com/2016/10/12/vsan-proactive-rebalance/

Thank you, Zach.

JimPhreak · ‎11-16-2016

So the rebalance did finish at 24 hours as stated in those links. However I still have the helper process using a lot of CPU on all 3 vSAN hosts. The host outside the vSAN cluster hosting the vSAN Wtiness Appliance is running the helper process normally (very low CPU usage).

Not sure where to go from here to troubleshoot this further.

JimPhreak · ‎11-16-2016

Found these entries over and over in my vmkernal.log. VLAN20 is used for vMotion on all hosts. Investigating this now.

2016-11-16T18:18:09.266Z cpu7:11087552)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829169, VLAN 20, seq 10829167 echo Eth pkt failed.

2016-11-16T18:18:09.266Z cpu7:11087552)NetHealthcheck: L2EchoTicketSend:1446: Eth send seq error: Not found

2016-11-16T18:18:09.267Z cpu10:11087553)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829170, VLAN 20, seq 10829168 echo Eth pkt failed.

2016-11-16T18:18:09.267Z cpu10:11087553)NetHealthcheck: L2EchoTicketSend:1446: Eth send seq error: Not found

2016-11-16T18:18:09.267Z cpu6:11087554)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829171, VLAN 20, seq 10829169 echo Eth pkt failed.

2016-11-16T18:18:09.267Z cpu6:11087554)NetHealthcheck: L2EchoTicketSend:1446: Eth send seq error: Not found

2016-11-16T18:18:09.267Z cpu10:11087555)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829172, VLAN 20, seq 10829170 echo Eth pkt failed.

2016-11-16T18:18:09.267Z cpu10:11087555)NetHealthcheck: L2EchoTicketSend:1446: Eth send seq error: Not found

2016-11-16T18:18:09.267Z cpu12:11087556)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829173, VLAN 20, seq 10829171 echo Eth pkt failed.

2016-11-16T18:18:09.267Z cpu12:11087556)NetHealthcheck: L2EchoTicketSend:1446: Eth send seq error: Not found

2016-11-16T18:18:09.267Z cpu5:11087557)NetHealthcheck: L2EchoSendVlan:1358: Build and send ticket 10829174, VLAN 20, seq 10829172 echo Eth pkt failed.

EDIT: Turns out the problem was the 'Teaming and Failover' Health service check on my vDS' that was causing the issue. Once I disabled that the CPU usage returned to normal. This is apparently a known issue.

All

Helper process consuming CPU on all vSAN hosts?