5.5 U1a - NetPort: 1632 disabled port errors in vm...

cesprov · ‎06-24-2014

Ever since upgrading to 5.5 U1a (from 5.1 U2), we have been seeing a lot of issues, particularly with SQL clusters. The problem isn't specific to SQL clusters but is more visible there as the outages are causing SQL nodes to fail due to getting isolated and losing quorum. In the WSFC cluster log, we see it humming along until out of nowhere the following errors:

<snip>

[NETFTAPI] Signaled NetftRemoteUnreachable event, local address <192.168.0.10:003853 remote address 192.168.0.11:003853

[NETFTAPI] Signaled NetftRemoteUnreachable event, local address 192.168.0.10:003853 remote address 10.0.0.10:003853

[IM] got event: Remote endpoint 192.168.0.11:~3343~ unreachable from 192.168.0.10:~3343~

<snip>

which indicate that the node can no longer reach any of the other nodes in the cluster. Corresponding with this event in the WSFC cluster log, we see the following event in the vmkernel.log at exactly the same time:

NetPort: 1632: disabled port 0x3000081

NetPort: 1426: enabled port 0x3000081 with mac 00:50:56:88:2e:ba

where the mac corresponds with the mac address of the node that dropped from the cluster. At this point I am not sure if the errors in the vmkernel.log are the problem or just a symptom of the problem. We are getting a lot of these NetPort 1632 errors across 8 hosts and not just Windows VMs, Redhat VMs also. The only similar issue I can find is KB2055853, but that says fixed in 5.5 U1. I did confirm that the 5.5 U1a VMware Tools were installed and the VMXNET3 driver being used is from the 5.5 U1a tools package. That may not be relevant however as the Redhat VMs are using E1000 drivers and I still see the NetPort 1632 errors for them, I was told by VMware support that the "1632" in the error is a process ID or something which is why it differs from the 1424 in the article, but I can't believe that as all 8 hosts generate the same 1632 "disabled" followed by the 1426 "enabled" messages, so it can't be the same process ID across all hosts. This is not occurring for all VMs at once so I don't think it's an underlying issue with the host.

Any ideas? Any chance any one knows what the "1632" error node means?

All

5.5 U1a - NetPort: 1632 disabled port errors in vmkernel.log