w3nd377
Contributor
Contributor

MSCS on ESX4.0 - cluster build times out

Attempting to build a Microsoft cluster of two VMs running Windows 2008 R2 64bit. The non-shard drives are on Dell MD3000i iSCSI SAN and the shard drives (10 total) reside on Dell AX4-5 FC SAN. System-1 has the mappings for the RDM which are used by System-2. There are four physical nodes in cluster, each node in cluster has 2 Quadcore CPU, 48 GB RAM, 12 NIC ports (4 for production network (2 active/2 standby), 2 for FT/HA, 2 for iSCSI, 2 for private isolated network, and 2 unused connections), and a single HBA card with dual connections (port0 connects to controller A of FC SAN, port1 connects to ContollerB of FC SAN). There is not a FC Switch between cluster nodes and SAN. DRS is enabled by default but disabled for System1 and System2 so that I can put them both on a single node if necessary. System1 and System2 presently reside on Node#1 in an attempt to reduce any lag. Both systems pass the validation test for MSCS. I attempt to build the cluster and the process stalls when it says "building the cluster" (last step of the chain in wizard) and then ultimately times out leading to failure. I have run the command "cluster log /g /level:5"...returns error 1753 but review of the resutling log file does not show me much that I can decifer or understand as a reason for failure.

Is there some kind of lag that others have experienced that prevents MSCS on Win2K8 R2 64bit from working on ESX 4.0 or is there some kind of build issue? Do I need to dump the cluster log file on here and let someone review it? Does the ESX node have to be bumped to 4.1 or 4.2 for MSCS to work? Is Windows 2008 R2 64bit still experimental and experiencing issues with MSCS and other "features" within Microsoft? Is this a Microsoft issue and ESX is working like it should?

Any assistance would be grately appreciated.

0 Kudos
1 Reply
w3nd377
Contributor
Contributor

After a call with Microsoft Support, its a Microsoft "feature" that was the issue. On Windows 2008, on each interface there is a checksum taking place on all traffic. To resolve the issue I was having, I had to disable RSS, edit the registry, disable checksum taking place on all interfaces, and uninstall Symantec Anti-virus temporarily. Below are the four things which were done to resolve the issue. Both sides of the cluster are now functioning and in cluster.

http://support.microsoft.com/default.aspx?scid=kb;EN-US;951037

1. Disable TCP RSS feature:

netsh int tcp set global rss=disabled

netsh int tcp set global chimney= disabled

2. Disable NetDMA feature:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Value Name: EnableTCPA

Value: 0

3. Disable adapter level checksum offloading in NIC properties.

  • Located Network and Sharing Center => Change adapter settings => Select an adapter, right click for properties=> Click Configure for the adapter=> This shows properties of the interface, go to Advanced=> Under properties list, had to disable flow control and anything listed doing Checksum (there were four total for me).

4. Uninstall Symantec Anti-virus

0 Kudos