Network connectivity lost & Network uplink redunda...

msiem · ‎06-24-2020

Good morning to all of you,

We're finishing up configuring our new production VMware 6.7 based cluster, however, we're experiencing kind of strange issue associated with Alarms and particular reset rules. The case is that we don't get email notifications when 'Network connectivity lost' error is normalized and turns Green after we get any ESXi host back online. The same applies to the situation when we shutdown one port from a particular port-channel group (then network uplik redundancy is lost). We get email notifications informing us that network connectivity or uplik redundancy is lost, but our reset rules don't seem to work.

1. We verified if some emails were blocked on the mail server, but according to the people responsible for this service - that's not the case.

2. I found this article and thought maybe there's an issue with the notification service in general.

3. Alarm rules work just fine - we get email notifications (as written above).

4. We didn't change anything in Network connectivity lost Reset rules except enabling SNMP traps and Send email notifications option and setting the proper email address.

When we go to a particular host and look at the Events we have the following set of events:

So there's no email notification after Network uplink redundancy lost changes from Red to Green.

We're wondering if this might have to do with the notification service itself.

[UPDATE, July 7th]

Today I carried out another test, and something really weird happened.

1. I shut down one port combined into a port-channel to trigger 'redundancy lost' alarm, and I got the notification - great!

2. I restored the previous status of that particular switch port so that the redundancy was recovered, notification was gone in vCenter, however, there was no email notification that the status had changed from Red to Green.

3. I decided to display hostd.log in 'live' mode to see what was (and is) happening.

4. At that point both vmnic0 and vmnic4 were online (port-channel was fully operational).

5. Then I noticed (in hostd.log file) that ESXi was - BY TURNS - generating the following events: esx.problem.net.redundancy.lost and esx.clear.net.redundancy.restored - it seemed to be like an infinite loop. Those log entries were being generated over and over again. It was like that up to the moment when I decided to restart management agents. Along the way we got two additional email notifications informing that redundancy was lost even though the port-channel was already up and running.

I'm pretty sure, it should not work like that.

[UPDATE, July 9th]

Today, I carried out another test against one of our ESXi hosts, and when I shut down one port in port-channel there was not a single alarm sent on email - no 'lost' event, no 'restored' event (only dell-related events appeared in vCenter: link down / link up). The other ESXi host worked better in this respect, but there was no email message after restoring redundancy. I've already collected some log files, made some screenshots, and I think that this post is pretty clear and doesn't require any further explanation. I think the Support team should look into this.

Message was edited by: Marcin Siemion

VMware Cluster environment comprises of ESXi, 6.7.0, 16075168 hosts, vCenter 6.7.0 Build 16046713, and vSphere Client version 6.7.0.44000

daphnissov · ‎07-07-2020

I'd open a SR, submit your logs and this post, and ask GSS to look into it for you.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

Jialm · ‎01-19-2021

Hello, did you get the solution?

All

Network connectivity lost & Network uplink redundancy lost & Reset Rules