vSAN Health Check Issues

Alex88 · ‎08-29-2018

Hello. We have a customer which runs vcenter server appliance 6.5 u1, twelve nodes running esxi 6.5 patch 2 build 7388607 and vsan vSAN 6.6.1 Patch 02.
This is stretched cluster configuration all flash with witness appliance in cloud. I have wrote static routes from data nodes of both sites toward the witness and from witness routes exists too. Cluster is functioning normally vsan is OK.BUT.
we have an warnings and errors in vsan health check.
vSAN: Basic (unicast) connectivity check
vSAN: MTU check (ping with large packet size
This errors are from witness to all 12 data hosts. and sometimes these warnings are gone away sometimes they are not 12 but 6 for example, but MTU error persists always. we have restarted vcenter and witness appliance but no result.
the customer have distributed switch configuration with lacp and nexus switches.

I have made lab configuration with witness appliance there are no error.

in vsanmgmt.log file in my lab ping tests are ok. here is my lab witness vsanmgmt log file fragment.

2018-08-29T11:34:20Z VSANMGMTSVC: INFO vsanperfsvc[782ef7de-ab7f-11e8] [VsanHealthPing::Ping] Run ping test for the hosts ['192.168.10.72', '192.168.10.73', '192.168.10.71', '192.168.10.76', '192.168.10.74', '192.168.10.75', '192.168.10.82', '192.168.10.81'] from local 172.17.2.52

2018-08-29T11:34:20Z VSANMGMTSVC: INFO vsanperfsvc[782ef7de-ab7f-11e8] [VsanHealthPing::PingTest] Pinger: all host response come back, ping done Seq:1, size:9000

2018-08-29T11:34:20Z VSANMGMTSVC: INFO vsanperfsvc[782ef7de-ab7f-11e8] [VsanHealthPing::Ping] Run ping test for the hosts ['192.168.10.72', '192.168.10.73', '192.168.10.71', '192.168.10.76', '192.168.10.74', '192.168.10.75', '192.168.10.82', '192.168.10.81'] from local 172.17.2.52

2018-08-29T11:34:20Z VSANMGMTSVC: INFO vsanperfsvc[782ef7de-ab7f-11e8] [VsanHealthPing::PingTest] Pinger: all host response come back, ping done Seq:2, size:9000

But on the customer side we have following warning in vsanmgmt.log file on the witness which is in cloud with complex network environment.

2018-08-29T10:35:21Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Run ping test for the hosts ['172.16.160.94', '172.16.160.191', '172.16.160.96', '172.16.160.193', '172.16.160.91', '172.16.160.196', '172.16.160.95', '172.16.160.192', '172.16.160.194', '172.16.160.195', '172.16.160.92', '172.16.160.93'] from local 172.16.252.100

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::PingTest] Pinger: select time out after waiting for 0.416111

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.191, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.193, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.91, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.196, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.95, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.192, size:9000, pingSeq:1

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.194, size:9000, pingSeq:1

and

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Run ping test for the hosts ['172.16.160.94', '172.16.160.191', '172.16.160.96', '172.16.160.193', '172.16.160.91', '172.16.160.196', '172.16.160.95', '172.16.160.192', '172.16.160.194', '172.16.160.195', '172.16.160.92', '172.16.160.93'] from local 172.16.252.100

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::PingTest] Pinger: select time out after waiting for 0.417736

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.191, size:9000, pingSeq:2

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.193, size:9000, pingSeq:2

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.91, size:9000, pingSeq:2

2018-08-29T10:35:22Z VSANMGMTSVC: INFO vsanperfsvc[3aa4dc88-ab77-11e8] [VsanHealthPing::Ping] Pinger: ping timeout: target:172.16.160.196, size:9000, pingSeq:2

So please help to recover this errors and warnings. I ll very appreciate.

Thanks.

Alex88 · ‎09-02-2018

Anyone can help? some thinks?

GreatWhiteTec · ‎09-04-2018

Hi Alex88,

Before going any further, I would suggest fixing the version mismatch between vCenter and ESXi. This vCenter (6.5 U1) was released on 2017-07-27, and ESXi build 73886207 was released 2017-12-19. vCenter is a couple of builds behind. It is always recommend that the vCenter is at the same, or higher build version in order to prevent issues similar to this. I'm not saying this is the problem 100%, but I have experienced "cosmetic" alerts/issues due to version mismatch in the past.

I'll recommend doing a quick vCenter upgrade to 6.5 U1d (or higher).

Alex88 · ‎09-05-2018

Hello. Thank you for your answer.

We have redeployed witness appliance within the local side and the problem disappeared at all. vsan heath check now show green items.

Then we redeployed witness in cloud which is located in France datacenter and we are located in Georgia and the problem repeated again.

So its not vCenter issue. Is seems that something is blocked from witness to data nodes.:(

GreatWhiteTec · ‎09-08-2018

Hi Alex88,

I don't know how you network is configured, but it sounds like you may have a routing issue here. The interfaces you selected for witness traffic cannot reach your witness even when the static routes are in place. Although the vSphere side may be configured correctly, your network does not know how to get there.

In this case you could use WTS (Witness Traffic Separation), and send your witness traffic through a different interface, other than the vSAN interface(s), that has L3 capabilities.

Witness Traffic Separation (WTS)

Setup Step 5: Validate Networking

Sreejesh_D · ‎09-10-2018

Hi Alex,

It may be an issue due to the MTU miss match. Please have a look into following KB.

VMware Knowledge Base

Alex88 · ‎09-11-2018

Hello. thank you for your answer.

I have tested routing connectivity from esxcli. And its ok.

every interface on data nodes ping every interface on witness and vice versa BUT health check shows that its not accessible.

If this is routing issue why this warning sometimes are less than 12 sometimes its 6 or 5. sometimes this warnings completely disappereas.

MTU are consistent across the all nodes.

I thing that due some connectivity ssues from cloud network which resides in France via vpn there are some limitations and some ping may lost sometimes? i dont know.

we have captured via tcpdump packets and i am waiting for my network guys to analyse this.

thanks.

All

vSAN Health Check Issues