VMware Cloud Community
Vel_VMware
Enthusiast
Enthusiast

vSAN health alarm __Basic (unicast) connectivity check and MTU check (ping with large packet size)

Hi,

Can anyone help me on this.

We are keep on receiving alerts like vSAN health alarm __Basic (unicast) connectivity check and vSAN health alarm __MTU check (ping with large packet size). VMware says that these alerts can be ignored and its just for network test purpose. But can some one please help me to provide fix this permanently.

Thanks in advance. 

Tags (1)
5 Replies
mprazeres183
Enthusiast
Enthusiast

Hi Vel_VMware

I find it a little bit funny that they told you, that you can ignore it...

We had the same issue and we didn't ignore it.


We had the same issue, but it was due to a VMKernel that wasn't working correctly. I just recreated a new Kernel, activated the vSAN Service on it, and then the issue disapeared.

IT can be that your Settings are below 1500MTU that will trigger this issues (Check on the Networking site) and if the MTU is under 1500 resolve this issue and then also the Error will disapear.

pastedImage_1.png

However, if for you this is not critical, you can disable it: just follow this step by step:

https://www.ivobeerens.nl/2016/09/07/disable-virtual-san-health-check-alarms/

Best regards,
Marco Frias

Check my blog, and if my answere resolved the issue, please provide a feedback. Marco Frias - VMware is my World www.vmtn.blog
jameseydoyle
VMware Employee
VMware Employee

Are both your vCenter and hosts all upgraded to vSphere 6.5 U1? When you look at the vSAN cluster settings, does it state that the cluster is operating in Unicast or Multicast mode?

srodenburg
Expert
Expert

You cannot fix this permanently.

Ignore it.

You can re-create vmkernel ports until you see green, it WILL come back. It happens completely random. It can stay away for months (with you declaring victory as it seems solved), only to popup again. Or never. Or anything in between.

I built and manage many vSAN sites and it happens in many sites. And some never ever see this error. Oh and before anyone starts: the network, hosts and vSAN are just fine.

Just disable the alarm.

In fact, you can disable several VSAN alarms as VMware has not show any interest in solving these glitches... The same with those crappy pro-active tests. They often don't work either, have never worked reliably in all these years and VMware just does not care.

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello srodenburg,

"You cannot fix this permanently." "And some never ever see this error. "

Bit contradictory there - maybe you should compare their set-ups.

"the network, hosts and vSAN are just fine."

For some reason everyone seems to get VERY defensive when their network configuration is even hinted at being a potential cause (of likely network-related issues...).

It is also apparent that most networks are in reality not configured like the X > Y > Z that is stated (and or perceived/documented), usually when you get granular it is more like:

X (band-aid here so F works) > (additional 'harmless' config so A functions) Y (something minor that *shouldn't* cause issues) > (we'll just set this like this for now and will surely document this and remember later) Z

As you have access to multiple sites that have/haven't hit this alert constantly I would imagine you have built up an extensive amount of knowledge with regard to what each set-up has in common at a granular network configuration level, it might be more helpful to share any such findings and info rather than saying 'ignore it - bugs'.

"In fact, you can disable several VSAN alarms as VMware has not show any interest in solving these glitches..."

Far from a fact and a terribly dangerous blanket-advisement to give without context/elaboration (let's just turn off these Disk + Data alerts they are so irritatingly red!) - A ridiculously large amount of VMware-engineering time has gone into progressively extending and improving the scope of the health-checks so really not sure where you get this notion.

If you can provide some other examples of health checks you consider broken, this might be more helpful.

"The same with those crappy pro-active tests. They often don't work either"

Which tests are you referring to?

If you are referring to failures in 'VM Creation test' then that very likely indicates an unhealthy or mis-configured cluster - not an issue with the test itself, same goes for performance stress-test.

I am only aware of one 'bug' regarding the multicast proactive test failing in 6.1 + early 6.2 and this was fixed within a reasonable time-frame.

Bob

Reply
0 Kudos
srodenburg
Expert
Expert

Hello Bob,

I respect you taking the time to write an elaborate reply. I've been using vSAN for a long time and a lot of frustration has built up. It got a lot better since 6.6, I must admit.

"As you have access to multiple sites that have/haven't hit this alert constantly I would imagine you have built up an extensive amount of knowledge with regard to what each set-up has in common at a granular network configuration level"

As we believe in standardisation, our network setups are the same. Either Cisco UCS based where vSAN traffic stays within the FI and does not go north to the core) or if not, dedicated vSAN Switches. Customers who "rolled their own network" are not taken into my consideration as you never know what "lurkes around on those nets".

So going back to standardised networks: these errors popup randomly and we never see issues with vSAN in general on those sites. I was never able to find a pattern why some customers see them sometimes, and others don't. In also never see them in Back2back Robo's so it must have something todo with switches. When such an error is triggered, and you dive into the WebGUI, you see that node 2 could not talk to Node 7 and Node 5 not to Node 3 and what not. Then press "re-test" and poof, all errors are gone. Everything talks to everyone just happily.

"In fact, you can disable several VSAN alarms as VMware has not show any interest in solving these glitches..."

I say that because it has been happening for years. For years. And so many installations are affected (and so many are not). And it's still happening. It gives the impression that these things do not have priority.

"Which tests are you referring to?"

The VM creation test. Super running cluster, no issues and it does not work on a regular basis. Again, for some it does, for others it doesn't. Never could find a pattern.

The multi-cast test has been a dog too. Some customers are on older 6.x releases (for various reasons, vxRail is only recently available with 6.5) and even I tell them to ignore failed MC test (I tell them to not even run it), they will still open a ticket saying "we have a problem with MC, the test does not complete without errors" and then I explain again that the test is broken. Then they want me to open an SR to get the test fixed. Which I know will not happen. Try explaining that... Such a drag.

"If you can provide some other examples of health checks you consider broken"

- MTU check. Ping with large packet size  (has always been broken, MTU's don't suddenly change and after a manual re-test, all is good again all of a sudden)

- Disk Balance  (clusters with a lot of stuff happening are often imbalanced to some degree, so upping the warning percentage a lot would help)

- Site latency health  (never figured why this alert traps sometimes (it's quite rare but it does happen even though latency was fine and the cluster humms along nicely)

And not broken, but simply annoying:

- Customer Improvement Program

- Build recommendation Engine

These 5 tests we disable by default as they are just a pain in the butt.

And in general, the tests concerning Controllers and their Firmware. We have customers that use controllers that where certified in 5.5.x and since 6.x, they suddenly are not anymore. Take the LSI 9207-8i for example. Since 6.x it's off the HCL. So vCenter complains all the time. But it's identical twin, the HP H220 is still on the HCL. It's the same friggin card. There are DELL and Fujitsu cards that are OEM clones of this card and they are on the HCL too.

I had a discussion about this with Duncan a while back. The people, mostly the HW vendors like LSI, simply don't bother testing existing validated hardware on newer vSAN versions. Do you honestly think that customers will either "not upgrade" or "rip out and replace all their cards in all nodes". Hell no. And the cards work fine. If a H220 works fine, a 9207-8i does too. And customers give me heat over it.

And the worst thing:  it will happen again with the next major release of vSAN. Some hardware will not be on the HCL anymore, or take a while before they appear on the HCL (with a complaining vCenter in the meanwhile) and existing customers can stick it...

vSAN major versions appear(ed) faster that hardware-lifecycles. Hardware is supposed to last 3 to 5 years at most customers, so they will run into this issue sooner or later. So that is why we turn the HW compatibly alerts off and tell customers that "that yellow triangle" is ok.

Reply
0 Kudos