VMware Cloud Community
jlrobinson2171
Contributor
Contributor
Jump to solution

4-node hybrid VSAN cluster with horrible performance

This is a lab environment that was just built.  It is a 4-node hybrid VSAN cluster with 1 2TB SSD and 5 capacity disks per host.  I have 2 1GB NICs dedicated to vSAN traffic.  Jumbo frames are enabled.

I am currently trying to build a Windows Server 2019 VM and it has been running for over an hour and is at 23% Getting files ready for installation.  The ISO is on a separate datastore.

I wasn't expecting super great performance but this is ridiculous and I'm at a loss on how to fix it or improve it.  Any suggestions or ideas of what I can look at?

Latency is the only metric I see being in the red but good grief this is painful.

1 Solution

Accepted Solutions
jlrobinson2171
Contributor
Contributor
Jump to solution

I got it sorted out but weird how I had to do it.

 

These servers have 2 quad port 1GB NIC cards and I originally had it set up for 2 ports management, 2 ports vMotion, 2 ports vSAN, and 2 ports for VM traffic.  That seemed like a logical way to do it but I had absolutely abysmal performance on vSAN.  At some point, I had the idea to give vSAN 4 ports so I made adjustments and added two more ports to the vSAN vmk.  That did not make things any better.  So then I had the idea to break out each NIC into its own Distributed vSwitch and a dedicated vSAN vmk with one port.  This seemed to be the magic bullet.  Latency dropped and I was able to successfully deploy a VM in a reasonable time.

 

Since I made so many changes and things looked like a mess, I decided to rebuild the cluster with my new network setup.  I had it set up like this, 2 ports management and vMotion, 2 ports VM traffic, and the 4 ports for vSAN broken out into their own 1 port vSwitch.  And for some odd reason, this did not work.  Latency went through the roof again and I was dumbfounded as it had worked the night previous.  I beat my head against my server rack for hours trying to figure out why it isn't working and then I realized the difference between the two setups was I didn't have a dedicated vMotion portgroup, I was sharing it with the management portgroup.  I made the change and it started working.

 

The final outcome is I have it set up like this, 2 ports for Management and VM traffic, 2 ports vMotion, and 4 ports for vSAN, and everything is working pretty well.  Eventually, I will move to 10GB networking but that is not in the budget so for now this weird setup will work.

View solution in original post

4 Replies
depping
Leadership
Leadership
Jump to solution

latency of what? you are not giving a whole lot of info for people to help. more details would be useful.

TheBobkin
Champion
Champion
Jump to solution

@jlrobinson2171 , As Duncan said, you haven't really provided any reasonable information of your findings or troubleshooting steps here, so really not sure what you are hoping to achieve.

 

Step 1 should be to check whether you have excessively high latency on just frontend (probably network issue) or frontend+backend (more likely a storage or system issue). These can be checked at the cluster level (assuming you have the performance service enabled) via:

Cluster > Monitor > vSAN > Performance

jlrobinson2171
Contributor
Contributor
Jump to solution

Right now the only VMs running on the cluster are the vCLS machines.

I attached a couple of screenshots and all I am doing at the time of these screenshots is uploading an ISO to a content library that uses vSAN as its storage.  That upload has been running for 30 minutes and is only at 11%, Windows Server 2019 ISO.

Running a continuous ping to a vSAN IP does drop the occasional ping

As I said, I have 2 1GB NICs dedicated to vSAN on vmk2.  Jumbo frames are enabled.

The network switch is an Unifi 48-port layer 2 switch.

Each host has 3 disk groups with 8 disks each (1 500GB SSD and 7 600GB 10k SATA).

These are refurbished drives and servers except for the SSDs which are Samsung EVOs, I know they are not enterprise grade but are on the HCL.

Just trying to get this into a usable state which it currently isn't.

jlrobinson2171
Contributor
Contributor
Jump to solution

I got it sorted out but weird how I had to do it.

 

These servers have 2 quad port 1GB NIC cards and I originally had it set up for 2 ports management, 2 ports vMotion, 2 ports vSAN, and 2 ports for VM traffic.  That seemed like a logical way to do it but I had absolutely abysmal performance on vSAN.  At some point, I had the idea to give vSAN 4 ports so I made adjustments and added two more ports to the vSAN vmk.  That did not make things any better.  So then I had the idea to break out each NIC into its own Distributed vSwitch and a dedicated vSAN vmk with one port.  This seemed to be the magic bullet.  Latency dropped and I was able to successfully deploy a VM in a reasonable time.

 

Since I made so many changes and things looked like a mess, I decided to rebuild the cluster with my new network setup.  I had it set up like this, 2 ports management and vMotion, 2 ports VM traffic, and the 4 ports for vSAN broken out into their own 1 port vSwitch.  And for some odd reason, this did not work.  Latency went through the roof again and I was dumbfounded as it had worked the night previous.  I beat my head against my server rack for hours trying to figure out why it isn't working and then I realized the difference between the two setups was I didn't have a dedicated vMotion portgroup, I was sharing it with the management portgroup.  I made the change and it started working.

 

The final outcome is I have it set up like this, 2 ports for Management and VM traffic, 2 ports vMotion, and 4 ports for vSAN, and everything is working pretty well.  Eventually, I will move to 10GB networking but that is not in the budget so for now this weird setup will work.