VMware Cloud Community
TomekBlaut
Contributor
Contributor
Jump to solution

vSAN - vsphere 7.0.2

Hello everyone, 

I have a question about vsan.

We create a new environment for View, ultimately 1000 vdi - based on Vsan technology consisting of 8 hosts - DELL R740XD vSAN Ready Node located in 2 server rooms in 2 locations, 4 hosts per page. The distance between the server rooms is not great - about 400 meters of connected cisco nexus 9000 - 40 GB, and the hosts themselves are plugged in 25 GB when it comes to vSan connections.

We have a vSAN Advanced for Desktop license.

I know that the best solution would be a strechcluster, but the license does not allow it.

At the moment we have one vSAN cluster configured with 8 hosts.

I wanted to ask what error domain should be used to make it resistant to failure of one page - assuming, for example, no power in one of the server rooms. E.g:
1.2 domains with 2 hosts and 4 hosts as Standalone Hosts - with this configuration we have FTT at level 2.
2.4 domains with 2 hosts each - with this configuration we have FTT at level 1.
3.2 domains with 4 hosts each - with this configuration we have FTT at 0 level.
How does such a configuration translate into the load on the environment - i.e. when we have options, e.g. options 1 - servers from domains 1 and 2 are working productively, and the rest are waiting for failure?

Moreover, I wanted to ask what the best disk policy would be in our case. I assume that for connection servers, e.g. 1 failure - RAID-5 (Erasure Coding), and for vdi 1 failure - RAID-1 (Mirroring).
I am asking for suggestions or possibly proposing an appropriate solution for our environment.

Have a nice day. 

Reply
0 Kudos
1 Solution

Accepted Solutions
TomekBlaut
Contributor
Contributor
Jump to solution

Thank you very much for your precise answer.
At the moment, we are not able to purchase a license. Maybe next year, but I am not sure here either. The plan was different, unfortunately, the Vmware licensing changed and it turned out as it turned out. Previously, the stretch cluster was available under an advanced license.
The servers are linked vSAN with 2 links each 25 GB to the nexus, and there is 40 GB between them, so I think that the performance problem is unlikely to occur. I'm afraid of a power failure in one of the server rooms. I don't know what might happen then. I understand that the machines that happen to be on these hosts will be inaccessible, and will they start to crawl on running. The second thing I fear is the network failure and the cluster will de-synchronic and how will it affect machines and the environment.
I know stretch cluster would be the best solution, but I have to manage somehow without it. Or maybe 2 separate clusters.

View solution in original post

Reply
0 Kudos
4 Replies
a_p_
Leadership
Leadership
Jump to solution

Moderator note: Moved to VMware vSAN Discussions

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@TomekBlaut, with the topology you have suggested, there is no way of protecting against a 'room' or site failure as the requirement for that to be possible is 3 sites (SiteA+SiteB+Witness).
Provided the network between the sites is solid and not heavily contended (e.g. shared with other heavy traffic users) and it is <1ms RTT then this will be okay as a standard cluster (as opposed to stretched).
If site failure tolerance is a requirement then you should look into the costs and possibility of upgrading the licensing level.

With such a configuration you need to carefully consider aspects such as the difference in inter-node network usage and pattern of RAID5/6 vs RAID1 data - it would likely be advised to be prudent and go with RAID1 unless you are sure the network between the server rooms can 100% handle this (and that there isn't a huge difference between 2 nodes communicating in the same room vs across rooms) - if there is a big difference in the latency and/or reliability of 'local' (same room) vs 'remote' (different rooms) communication then it is possible you would have very random performance (e.g. 1/some vmdk on VMs with poor performance but others on the same VMs okay) depending on where the VM is running and where the backing data-components reside (and whether they are split across rooms or all 'local').

Reply
0 Kudos
TomekBlaut
Contributor
Contributor
Jump to solution

Thank you very much for your precise answer.
At the moment, we are not able to purchase a license. Maybe next year, but I am not sure here either. The plan was different, unfortunately, the Vmware licensing changed and it turned out as it turned out. Previously, the stretch cluster was available under an advanced license.
The servers are linked vSAN with 2 links each 25 GB to the nexus, and there is 40 GB between them, so I think that the performance problem is unlikely to occur. I'm afraid of a power failure in one of the server rooms. I don't know what might happen then. I understand that the machines that happen to be on these hosts will be inaccessible, and will they start to crawl on running. The second thing I fear is the network failure and the cluster will de-synchronic and how will it affect machines and the environment.
I know stretch cluster would be the best solution, but I have to manage somehow without it. Or maybe 2 separate clusters.

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@TomekBlaut You are very welcome.

 

If there was a power-failure in one of the server rooms where a standard cluster configuration across the rooms is configured, this would be 4 failures - e.g. if you have FTT=1 data then the vast majority of this would be inaccessible until power and data availability was restored - this wouldn't just impact the VMs running on the nodes in that room but all VMs running in both rooms (as all data would be stored as components distributed across all nodes regardless of location). This isn't a stretched-cluster and thus there isn't any feasible way to configure this to act like one and thus site/room protection isn't going to be possible here.

 

If you have significant doubts about one of the server rooms infrastructure vs the other (e.g. older/cheaper/less reliable supporting power equipment, on a less reliable power grid etc.) then running this as 2 clusters with more critical workloads running on the 'good' site and less critical on the other site might be a valid idea (and/or possibly in addition to replication or some form of DR from the 'good' site cluster to other in case that one fails).

Reply
0 Kudos