Intermittent vSAN issue..

TryllZ · ‎03-12-2019

Hi,

I have a lab deployment and am facing some issue in the first vSAN (4 nodes) creation (it works some times, and other times it doesn't) I'm trying to troubleshooting the issue as to when it doesn't work what is going on.

Once I have added the nodes to the cluster, I add to each node a VMKernel Adapter and attach the VMKernel adapter to new switch to which I attach a physical port, then I add vSAN and Management in VMKernel settings, then I attach cache, and capacity disks and create a vSAN I can only 100GB SSD in total when I have 100GBx4 SSD.

The issue sometimes this works (not sure if this is due to sequence or what) sometimes it doesn't, of late I have tried it 5 times of which 2 times it worked other times it didnt, this is an All-Flash vSAN.

I have noted that the vSAN Health check says I have ping failed on all nodes, not sure why.

Any hints as to what I might be doing wrong.

Thank You

TryllZ · ‎03-12-2019

So I did a fresh setup with iSCSI disk.

No I can vmkping on all interfaces of but the disk still shows only 100GB instead of 400GB in vSAN.

I received the HCL error for which I updated the HCL list using the json file from vmware, the error is gone.

I created a new VMKernel adapter, added it to a new switch, added a new pysical interface to it, added vSAN option to it, however, I am receiving "Host cannot communicate with one or more nodes in the vSAN cluster" for all the nodes.

Anything else I should have done but I didn't.

Thanks all.

TheBobkin · ‎03-12-2019

Hello TryllZ,

"it works some times, and other times it doesn't"

Do you mean sometimes configuring the cluster will work (and if so, does it remain stable) or that sometimes it will not form cluster at all?

"I add to each node a VMKernel Adapter and attach the VMKernel adapter to new switch to which I attach a physical port, then I add vSAN and Management in VMKernel settings,"

Do you have the Management (and/or any other) traffic on the same vmkernel, uplinks and/or subnet? vSAN+Management should not be configured on the same vmk (other than WTS to witness on a stretched-cluster), these traffic types should also not be in the same subnet/VLAN and if they share links then consider nic0/nic1 set as Active/Standby Active/Standby in opposing order for each traffic type on the port-groups. Are you using 1Gps or 10Gps link(s) and switch(es)? Is MTU set consistently end-to-end and have you tested this properly in both directions inter-host on the vSAN-enabled interfaces?

e.g. from Host1:

# vmkping -I vmkX <vSAN-IP_Host2> -s 1472 -d -c 100

Or if using 9000 MTU:

# vmkping -I vmkX <vSAN-IP_Host2> -s 8972 -d -c 100

"No I can vmkping on all interfaces of but the disk still shows only 100GB instead of 400GB in vSAN."

If you are seeing 1/4 of your capacity then it is most likely cluster partitioned e.g. 'esxcli vsan cluster list' shows membership as 1.

The above should help ascertain whether there is stable connection inter-node - if there is then next step is to check ports and traffic (12345 for Unicast, 23451 for Multicast) and the state of the 'esxcli vsan cluster unicastagent list' on each node, potentially you are not completely unconfiguring hosts each time you redo this and causing issues. What build number are your ESXi hosts and vCenter here?

Bob

TryllZ · ‎03-12-2019

Thanks Bob for the clarification.

What I mean by sometimes is that sometimes the vSAN gets created, and works fine I can add files/folders to it, other times I receive Cannot complete file operation error.

I tried both ways for the vSAN+Management on the same vmkernal and separate, the end result is the same, in both cases I get the same issue, when the above occurs I never get to see 400GB disks, I always see 100Gb only.

The following is from one of the hosts.

[root@esxi1:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2019-03-12T18:23:05Z

Local Node UUID: 5c69089b-aa4a-1496-3cd9-000c29287b12

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 5c69089b-aa4a-1496-3cd9-000c29287b12

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52a26b42-7714-b306-54ff-729833563cfe

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 1

Sub-Cluster Member UUIDs: 5c69089b-aa4a-1496-3cd9-000c29287b12

Sub-Cluster Member HostNames: esxi1.localdomain

Sub-Cluster Membership UUID: 1cf7875c-eabf-e640-f564-000c293ef8c1

Unicast Mode Enabled: true

Maintenance Mode State: OFF

Config Generation: c0f3ddef-6c0d-432a-802d-e98c94d557a5 3 2019-03-12T18:14:25.335

[root@esxi1:~] esxcli vsan network list

Interface

VmkNic Name: vmk1

IP Protocol: IP

Interface UUID: c0f5875c-e0c9-2a35-becb-000c293ef8c1

Agent Group Multicast Address: 224.2.3.4

Agent Group IPv6 Multicast Address: ff19::2:3:4

Agent Group Multicast Port: 23451

Master Group Multicast Address: 224.1.2.3

Master Group IPv6 Multicast Address: ff19::1:2:3

Master Group Multicast Port: 12345

Host Unicast Channel Bound Port: 12321

Multicast TTL: 5

Traffic Type: vsan

This is ESXi 6.7 U1, and vCenter 6.7 U1.

This is in workstation so I have done the basic setting and taken a snapshot, then I have tried each time after restoring to basic configuration.

I tried the command esxcli vsan cluster unicastagent list but nothing happens.

TheBobkin · ‎03-12-2019

Hello TryllZ,

As I said, you are only seeing 1/4 of the total storage because it is only able to see the configured capacity of one node because you have a partitioned or never-formed cluster:

Sub-Cluster Member Count: 1

So either you have a network configuration/communication issue or a clustering issue e.g. you are trying to add nodes to a cluster configuration but you never removed some/all of the old cluster configurations. Once you have validated the networking aspect as I informed in last comment look at the latter (can even take vCenter out of the loop by doing it all via CLI if potentially any issue at that layer).

Bob

TryllZ · ‎03-12-2019

That snapshot is taken before creating a cluster, and not after thus I doubt that any previous configuration exists.

Also, sorry but I'm kind of lost here, all the pings are working, not sure where to look for the network configuration / communication issue.

I have added all the screen shots of network configuration at Imgur: The magic of the Internet , if you could look through.

Thank You

TheBobkin · ‎03-12-2019

Hello TryllZ

Are you reverting from snapshot on the vCenter as well or just the nodes?

I ask as the vSAN cluster Object 'New Cluster' resides in vCenter database not on the hosts - yes the hosts have their own details regarding what their cluster is but if there is a disparity in these you could hit issues.

Note that moving nodes into a vSAN-enabled cluster Object will attempt to join them to the cluster once a reachable vSAN-enabled vmk is configured (note you don't show this traffic type 'enabled' there, move the slider and check on all nodes vmks)

Likely not the cause but should be ruled out (and is not good practice) - you are multihoming the traffic types there in the same subnet unless that is a /29 range (e.g. 255.255.255.248):

VMware Knowledge Base

Most of the other elements/alerts there are just symptoms of having a one node cluster (e.g. can't create test or performance Objects due to having a single node available for component placement).

Place all nodes in Maintenance Mode (no Action), delete any Disk-groups that exist (validate no partitions exist on any devices previously used), move all nodes out of the 'New Cluster' into Datacenter level, validate that all nodes state vSAN is not enabled (esxcli vsan cluster get), create a new vSAN cluster Object, enable vSAN on it, move all the nodes into it and try configure it again.

If the above doesn't work then you could rule out vCenter/not snapshotting/reverting vCenter by performing all the remove steps above, creating a cluster Object with vSAN not enabled on it, tagging the vSAN vmks then configuring the rest via CLI ('esxcli vsan cluster new', 'esxcli vsan cluster get' note the cluster UUID, then on the other nodes 'esxcli vsan cluster join -u <ClusterUUID>', populate the unicast lists on all hosts (hosts should NOT have their own addresses just the other members) using 'esxcli vsan cluster unicastagent add -a <vSAN-IP> -u <member-UUID>') - might have to set /vSAN/IgnoreClusterMemberListupdates on each host to prevent vCenter from pushing blank list (as no vSAN enabled there) but mainly saying to try this way to validate/rule-out the points noted above.

Bob

TryllZ · ‎03-12-2019

Appreciate the detailed reply.

I will go through all of it slowly and understand it.

The snapshots are created with IP configuration ONLY.

When I revert all VMs are reverted, so I have to create everything from scratch, networks, cluster, vSAN.

Thanks again.

TryllZ · ‎03-13-2019

Hi again,

So I attempted the following.

I tried to join the nodes manually using the unicastagent add command, followed by the /vSAN/IgnoreClusterMemberListupdates command,that did not work.

My unicastagent list still remains empty.

Currently I'm going to start from adding join to vSAN and get the logs from each node to see what the last error is.

Hope to find something..

TryllZ · ‎03-13-2019

Currently I'm out of attempts, not sure what is remaining, nothing is working.

Got the logs, added as attachment, added for 2 nodes only, if you get some time, do go through, appreciate the help.

Of note is that some errors are repeating for example, the VMKernel shuts down, some disks are getting added into vSAN.

Thanks..

TryllZ · ‎03-13-2019

Thanks a lot Bob, much much appreciate your assistance and giving the timeout of your schedule.

I found out what the issue is.

Since I'm using Workstation and running clones the system UUID is not being generated as new and the same UUID is being used in all ESXi servers (which should not be the case when a clone is created).

So I found the solution here : http://www.vmwarearena.com/vsan-cluster-partition/

I followed as instructed and now I have full 400GB of disks and can access the disk, and create folder in it as well.

Thank you once again for ll the help provided.

TheBobkin · ‎03-13-2019

Hello TryllZ

Happy to help troubleshoot it, too bad you didn't add any cluster get info from multiple hosts or logs before as would have saved you some time as it is fairly clear both hosts have same UUID (and they would also have been the same UUIDs when adding unicast entries):

esxi1 - vmkernel.log

2019-03-13T12:55:58.014Z cpu1:2102510)CMMDS: DiscoverySendHeartbeat:1561: Electing (local node) 5c69089b-aa4a-1496-3cd9-000c29287b12 to be master

2019-03-13T12:55:58.014Z cpu1:2102510)CMMDS: CMMDSLogStateTransition:1423: Transitioning(5c69089b-aa4a-1496-3cd9-000c29287b12) from Discovery to Master: (Reason: The local node was elected as master)

esxi2 - vmkernel.log

2019-03-13T12:55:59.973Z cpu0:2102496)CMMDS: DiscoverySendHeartbeat:1561: Electing (local node) 5c69089b-aa4a-1496-3cd9-000c29287b12 to be master

2019-03-13T12:55:59.973Z cpu0:2102496)CMMDS: CMMDSLogStateTransition:1423: Transitioning(5c69089b-aa4a-1496-3cd9-000c29287b12) from Discovery to Master: (Reason: The local node was elected as master)

Never bumped into this myself before as always do individual deploys for hosts in WS - sure it takes longer but less potential issues with vCenter and other products also - I recall cloning a Witness recently and wrecking a (test) cluster :smileygrin: .

Bob