This is a lab environment that was just built. It is a 4-node hybrid VSAN cluster with 1 2TB SSD and 5 capacity disks per host. I have 2 1GB NICs dedicated to vSAN traffic. Jumbo frames are enabled.
I am currently trying to build a Windows Server 2019 VM and it has been running for over an hour and is at 23% Getting files ready for installation. The ISO is on a separate datastore.
I wasn't expecting super great performance but this is ridiculous and I'm at a loss on how to fix it or improve it. Any suggestions or ideas of what I can look at?
Latency is the only metric I see being in the red but good grief this is painful.
I got it sorted out but weird how I had to do it.
These servers have 2 quad port 1GB NIC cards and I originally had it set up for 2 ports management, 2 ports vMotion, 2 ports vSAN, and 2 ports for VM traffic. That seemed like a logical way to do it but I had absolutely abysmal performance on vSAN. At some point, I had the idea to give vSAN 4 ports so I made adjustments and added two more ports to the vSAN vmk. That did not make things any better. So then I had the idea to break out each NIC into its own Distributed vSwitch and a dedicated vSAN vmk with one port. This seemed to be the magic bullet. Latency dropped and I was able to successfully deploy a VM in a reasonable time.
Since I made so many changes and things looked like a mess, I decided to rebuild the cluster with my new network setup. I had it set up like this, 2 ports management and vMotion, 2 ports VM traffic, and the 4 ports for vSAN broken out into their own 1 port vSwitch. And for some odd reason, this did not work. Latency went through the roof again and I was dumbfounded as it had worked the night previous. I beat my head against my server rack for hours trying to figure out why it isn't working and then I realized the difference between the two setups was I didn't have a dedicated vMotion portgroup, I was sharing it with the management portgroup. I made the change and it started working.
The final outcome is I have it set up like this, 2 ports for Management and VM traffic, 2 ports vMotion, and 4 ports for vSAN, and everything is working pretty well. Eventually, I will move to 10GB networking but that is not in the budget so for now this weird setup will work.
latency of what? you are not giving a whole lot of info for people to help. more details would be useful.
@jlrobinson2171 , As Duncan said, you haven't really provided any reasonable information of your findings or troubleshooting steps here, so really not sure what you are hoping to achieve.
Step 1 should be to check whether you have excessively high latency on just frontend (probably network issue) or frontend+backend (more likely a storage or system issue). These can be checked at the cluster level (assuming you have the performance service enabled) via:
Cluster > Monitor > vSAN > Performance
Right now the only VMs running on the cluster are the vCLS machines.
I attached a couple of screenshots and all I am doing at the time of these screenshots is uploading an ISO to a content library that uses vSAN as its storage. That upload has been running for 30 minutes and is only at 11%, Windows Server 2019 ISO.
Running a continuous ping to a vSAN IP does drop the occasional ping
As I said, I have 2 1GB NICs dedicated to vSAN on vmk2. Jumbo frames are enabled.
The network switch is an Unifi 48-port layer 2 switch.
Each host has 3 disk groups with 8 disks each (1 500GB SSD and 7 600GB 10k SATA).
These are refurbished drives and servers except for the SSDs which are Samsung EVOs, I know they are not enterprise grade but are on the HCL.
Just trying to get this into a usable state which it currently isn't.
I got it sorted out but weird how I had to do it.
These servers have 2 quad port 1GB NIC cards and I originally had it set up for 2 ports management, 2 ports vMotion, 2 ports vSAN, and 2 ports for VM traffic. That seemed like a logical way to do it but I had absolutely abysmal performance on vSAN. At some point, I had the idea to give vSAN 4 ports so I made adjustments and added two more ports to the vSAN vmk. That did not make things any better. So then I had the idea to break out each NIC into its own Distributed vSwitch and a dedicated vSAN vmk with one port. This seemed to be the magic bullet. Latency dropped and I was able to successfully deploy a VM in a reasonable time.
Since I made so many changes and things looked like a mess, I decided to rebuild the cluster with my new network setup. I had it set up like this, 2 ports management and vMotion, 2 ports VM traffic, and the 4 ports for vSAN broken out into their own 1 port vSwitch. And for some odd reason, this did not work. Latency went through the roof again and I was dumbfounded as it had worked the night previous. I beat my head against my server rack for hours trying to figure out why it isn't working and then I realized the difference between the two setups was I didn't have a dedicated vMotion portgroup, I was sharing it with the management portgroup. I made the change and it started working.
The final outcome is I have it set up like this, 2 ports for Management and VM traffic, 2 ports vMotion, and 4 ports for vSAN, and everything is working pretty well. Eventually, I will move to 10GB networking but that is not in the budget so for now this weird setup will work.