VMware Cloud Community
wishihad
Contributor
Contributor

VSAN - intermittent network issues

Have a new build cluster of Dell 730s and 740s, 14 total. 3 dis switches configured, data for VMs, VSAN and vMotion. Each has a primary 10G  and secondary 10G link with jumbo MTU on all but the data. So my issue is are hosts are continually throwing alarms for ping health and big MTU but then resolves and then comes back on another or multiple hosts, then goes away. Also seeing portion errors every so often. Typically it seems stable when nothing is happening, no users on VM’s. Also, having failures when publishing, seems to get to last step and fails, with logs showing something about error cloning..

So a couple questions, obviously something’s up with my network but what? We’ve verified MTU size and can ping with from each host to  the next with the -s 9000 without issue. So here’s the rest of the environment, we don’t have EVC on, can’t without jumping through some hoops as VC is on our hosts. So was wondering if different hardware sets could cause network and publishing issues? Oir hosts have different intel chips and greater storage. When we tried to vMotion a running vm from a 730 to a 740 it fails, so I know we need to get EVC turned on, but does anyone think that will cure our other issues?

4 Replies
wishihad
Contributor
Contributor

Sorry, auto correct changed a few things:

is are = is our

Portion = partition

0 Kudos
TolgaAsik
Enthusiast
Enthusiast

Hello,

Did you check the physical switch ports if there is any discard packets on physical switch ports which your hosts are connected?

Also MTU 9000 must be set on VMware side and physical switch side. Did you verify it?

Last question is are you sure NIC firmware and driver is up-to-date?

In my last case ESXi hosts are partioned from network (host isolation) then automaticall in a few minutes came up again. The issue is adressed to the network switches, becuase of a lot of packet discards were observing on there.

0 Kudos
wishihad
Contributor
Contributor

Thanks for the response!

Yes we verified no dropped packets, and MTU on switch is set to 9216 to account for overhead.

But, we did try something and it seems to have worked. We removed all the 740s from the cluster, leaving 9 identical hosts and everything stabilized and we published without error. However HA is not on so hopefully turning that back on doesn’t break anything. Our current plan is to leave the 730s in one cluster and put the 740s in a new cluster with EVC turned on for that cluster. The 730s are EOL so don’t expect to have to add anymore.

Thoughts?

TolgaAsik
Enthusiast
Enthusiast

Good plan.

Always I am afraid to enable EVC on VSAN enabled cluster. But in normal life you are right if no budget, you should do it.

I think you should not have any case. EVC is a feature of vSphere HA, not VSAN. The important thing the correct EVC policy should be defined while confuring the Cluster HA.

0 Kudos