Windows Cluster Fails on Tools Upgrade

AdamW201110141 · ‎04-18-2011

Hey there,

Very odd situation. We needed to upgrade the tools and apply a security patch on one of our Windows 2008 Clusters. Both nodes are virtualized. We moved all the cluster resources to one node. On the inactive node we installed the security patch, rebooted. No issue. All services for the cluster remained available.

We then installed updated Vmware Tools on the same inactive node and rebooted. The Windows Cluster resource went offline as the passive node rebooted. Event logs show loss of connectivity to the Quorum but nothing else. Once the inactive node came back online, the cluster was fine.

Moved everything from the active node to the inactive node and repeated the process. Same result.

I'm kind of at a loss. How can a tools upgrade on a passive node cause the entire windows cluster to fail?

Any ideas would be most helpful.

Thanks,

-Adam

AndreTheGiant · ‎04-18-2011

VMware Tools upgrade was from a previous release (for example 4.0 to 4.1)?

Sure that the problem was in quorum? or maybe in hearbeat? (do you use two different network to test heartbeat?).

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

AdamW201110141 · ‎04-18-2011

The tools were pretty old. From the 3.5 installs we had here. Upgrade was from that to 4.1.

From the logs, it appears that the cluster resources never fully got moved from node01 to node 02 or at least node 01 thought it was still the owner when it came back up. The last event log show a duplicate IP on the network. I know there's quite a few, but I posted the logs below in sequence if you'd like to see them.

We are using a heartbeat network.

-----------------------------------------------------------------------

Event ID: 1129 Source: FailoverClustering Node: ClusterNode02
Cluster network 'Cluster Network 1' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1126 Source: FailoverClustering Node: ClusterNode02
Cluster network interface 'ClusterNode 02 - Local Area Connection' for cluster node 'ClusterNode 02' on network 'Cluster Network 1' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1135 Source: FailoverClustering Node: ClusterNode02
Cluster node 'ClusterNode 01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1177 Source: FailoverClustering Node: ClusterNode02
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1564 Source: FailoverClustering Node: ClusterNode02
File share witness resource 'File Share Witness (\\shareSvr\share$)' failed to arbitrate for the file share '\\shareSvr\share$'. Please ensure that file share '\\shareSvr\share$' exists and is accessible by the cluster.

-----------------------------------------------------------------------

Event ID: 1561 Source: FailoverClustering Node: ClusterNode02
The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node.
Try starting the cluster service on all nodes in the cluster. If the cluster service can be started on other nodes with the latest copy of the cluster configuration data, this node will be able to subsequently join the started cluster successfully.

If there are no nodes available with the latest copy of the cluster configuration data, please consult the documentation for 'Force Cluster Start' in the failover cluster management snapin, or the 'forcequorum' startup option. Note that this action of forcing quorum should be considered a last resort, since some cluster configuration changes may well be lost.

-----------------------------------------------------------------------
Event ID: 1126 Source: FailoverClustering Node: ClusterNode01
Cluster network interface 'ClusterNode02 - Local Area Connection' for cluster node 'ClusterNode02' on network 'Cluster Network 1' is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1126 Source: FailoverClustering Node: ClusterNode01
Cluster network 'Cluster Network 1' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1135 Source: FailoverClustering Node: ClusterNode01
Cluster node 'ClusterNode02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

-----------------------------------------------------------------------

Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource 'File Share Witness (\\shareSvr\share$)' in clustered service or application 'Cluster Group' failed.

-----------------------------------------------------------------------

Event ID: 1564 Source: FailoverClustering Node: ClusterNode01
File share witness resource 'File Share Witness (\\shareSvr\share$)' failed to arbitrate for the file share '\\SshareSvr\share$$'. Please ensure that file share '\\shareSvr\share$' exists and is accessible by the cluster.

-----------------------------------------------------------------------

Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource 'Cluster IP Address' in clustered service or application 'Cluster Group' failed.

-----------------------------------------------------------------------

Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
The Cluster service failed to bring clustered service or application 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

-----------------------------------------------------------------------

Event ID: 1069 Source: FailoverClustering Node: ClusterNode01
Cluster resource 'IPv4 Static Address 1 (ClusterResource01)' in clustered service or application 'ClusterResource01' failed.

-----------------------------------------------------------------------

Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
Cluster IP address resource 'BAN FrontEnd' cannot be brought online because a duplicate IP address '192.168.8.27' was detected on the network. Please ensure all IP addresses are unique.

-----------------------------------------------------------------------

Event ID: 1205 Source: FailoverClustering Node: ClusterNode01
The Cluster service failed to bring clustered service or application 'ClusterResource01' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

AndreTheGiant · ‎04-18-2011

Very old VMware Tools needs to be removed... For this reason probably you loose both the heartbeat and the storage connection.

IMHO, I suggest to free the node from each resource and only then make the upgrades.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

AdamW201110141 · ‎04-18-2011

OK, just to clarify, are you saying that because the tools were older on the guests, that when we upgraded the tools on the inactive node (01), that it caused the active node (02) to loose both heartbeat and storage connection?

We did move all the cluster services and resources from the 01 node to the 02 node before rebooting it so it seems weird that it would have caused an issue. We also did the first reboot with no issue. Only when we upgraded the tools and rebooted did the cluster have an issue.

Is there a place to to set the cluster to check multiple networks before failing to another node in the event of a failure?

AndreTheGiant · ‎04-18-2011

If you have problem with the other node (not the one that you are upgrading) than is very strange...

I remember that there was a problem with Windows 2008 disk that can goes offline during the VMware Tools upgrade (this appens only in 3.5 -> 4.0 upgrade)... But it's strange that a offline disk make the disk offline also on the other node.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

drheim · ‎11-20-2019

8 years later and I inherit an older 2008R2 cluster and had 100% the exact same problem. Offline nodes can run updates reboot, etc., but when updating vmware tools the entire cluster goes down with the same errors you described. Hopefully it goes away when we upgrade

srinivas0781 · ‎04-17-2020

Even We had the same issue and below is my scenario

1. My Servers running on Windows 2012 R2 Standard

2. WFCS with SQL AG cluster with no shared disks

3. Node 1 is active and Node 2 is passive

4. All my cluster core resources and Cluster group is on Node 1 which is active and running fine with no issues from past several days

5. Both my nodes are running on outdated VMware tools.

6. As my Node 2 is passive, we have updated the VMware tools on Node 2.

7. Post successful completion of Vmware tools on Node 2, Cluster service on Node 1 is down which resulted in fail over of resources to Node 2

Please help me to understand, why updating VMware tools on passive node is impacting the cluster service on the active node and making the fail over. The same never happens if we reboot the passive node.

Any help is much appreciated.

Regards,

Srinivas