We are having a lot of performance issues with our VMs and a lot of staff are complaining that applications are slow and freeze (for example restarting a window server use to take 5-10 sec, and now it take over 5 minute, new windows seems to take half a sec between click and action etc).
I'm having difficulties trying to troubleshoot these issues so i'm trying to start at basic as i have inherited this setup a while back and i have been trying to adapt to changes as needed and not everything that has been done in the past make sense to me.
For our setup we run 2 datacenter that are almost identical, both have a vcenter appliance and replication appliance.
Our infrastructure is composed of a Dell VRTX chassis with 4 blades and 40Tb of shared internal storage on raid 60, vmfs. we run vmware 6.7 u3
Each blade has 4 physical network, 2x1gb and 2x10gb
We have a standard switch on each with the 2x1gb setup, connected to the vmkernel vmk0, default stack
We have also a distributed switch using the 2x10gb for each where everything else is connected.
We went through a major re-ip work recently and basically 172.18.4.0/24 is our management network on main datacenter. 10.4.0.0/16 is remnant of the old network, local only.
So my questions would be :
Am i correct to think that all replication traffic would go through the 1gb management interface ?
Do we have enough vmkernel and is the vmkernel setup look fine
Where do i look for performance issues after that.
i hope that make sense and appreciate any help.
The replication traffic will always go over Management VMkernel unless you mark another VMkernel as "vSphere Replicaiton" one (I recommend creating a new one). Anyways restarting a Windows VM and booting up from 5 sec to 5 minutes seems a little drastic to me.
I think you should look at Disk Latency and you can monitor that from the VM Performance tab (Virtual Disk and Datastore) and also you can see the ESXi where the VM is located if it is experiencing latency accesing the Datastore.
Also a major concern is that you said it was working fine. What change has been made for all your application and OS to start working that bad?
And to monitor the network usage, latency, packets delivered/dropped, etc. You can see the ESXi and look at the usage of the vmnics.
I will look into making an other vmkernel, does it matter if it's in the same ip range as management interface ?
it wasn't overnight it may have been a progressive change.
looking at stats, i can see some have random spike at 500-1kms but that's very infrequent.
from the datastore point of view :
we do have over 130 VM on that single VMFS but we do not have really much other choice
Hi there! For replication traffic it's always recommended to have a dedicated VMkernel and a dedicated VLAN. If you use 2 VMkernels for management and replication traffic but over the same vlan, the performance won't improve that much
So to be clear everything is local storage, and not anything like vsan, but just local disks?
Have you tried throttling the replication to see if the latency improves?
I'm wondering if your just overloading the local storage. I'd monitor using esxtop and watch the different latency metrics during that time frame
The redesign of your VMkernels and the isolation of traffic in different VLANs will definetely help but only depending on which VMNICs you are configuring them. VLAN Isolation is used for security perspective and for reduce the broadcast domains anyways the Physical NICs are going to be shared by all the VMkernels.
You may also consider having Network I/O Control to give more priority for network usage for the virtual machines, anyways this may not be your issue as you experience some weird latency on the Virtual Disk of your VMs.
On that dashboard where you are showing the top 10 VMs is not so clear, you should review some of the VMs that are facing issues so see if they have Latency on their virtual disk. Also you are saying that you have a single VMFS, what is this VMFS backed by, FCoE, FC, iSCSI? Give us a little more insight about your architecture.
Hey, hope you are doing fine
There were some excellent replies, all worth considering.
Also, if you are using big VMs I'd suggest using 10GBE NICs for replication and vMotion with dedicated VLANs.
In addition to that, here are some thing I would check
are you entirely sure this is a VMkernel/networking issue? can you provide some performance charts?
Are you dropping packets? If you are check MTU configuration on the nics and switches
Did you check your resource overcommitment (Memory/CPU)?
A very easy thing to do to check for resource contention would be this:
log in to an esxi host via ssh and run esxtop (ESXTOP - Yellow Bricks | Yellow Bricks )
By default it will load the CPU metrics, so check for high %RDY values.
then hit the m and look only for the memory state of the host. If it is anything else than High State you are running on memory starvation.
Did you check your storage latency?
You can get that from vCenter performance charts
Are your VMs running on snapshots?
Do all your VMs have VMware tools installed?
Which NICs are you using?
For Windows Servers is not recommended to use E1000/E1000e adapters. Try swapping to VMXNET3 (which require VMware tools)
Do you have vROPs available? it is a great tool for monitoring performance and resources, it should also give you a lot of tools for start a troubleshooting.
Did you run an RVtools agains your vCenter? It has a tab called vHealth wich should indicate a lot of errors.
Hope this works for you
Let me know if you need any help
thanks all for checking into this.
so to answer all the questions
i'm not sure this is a vmkernel issue but i'm worried our setup is not done properly as things have changed over the year, especially since we have replication now.
our internal VMFS and only vmfs is a DAS inside the VRTX chassis shared between the 4 blades. each blade have a small local vmfs but we do not use them as they wouldn't be able to be migrated between hosts automatically.
the replication is done through our MPLS so we are technically limited to max 100mbps, shared with normal network traffic
I cannot see any disk latency issues, there are the odd 500ms+ spike but they are very scare and do not feel related to performances issues.
disks queues within windows never go over 1 either.
I believe our environement is rather clean, using rvtools to keep an eye on the status.
all network on VMXNET3 beside an old windows server 2008 and an SBS
all vms have the vmtools installed and up to date.
i will have to look at esxtop when i got time this week.
Hey, I see that there is a lot of CPU ready.
Can you use the advanced perfcharts on this
VM --> Monitor --> Performance --> Advanced --> Chart Options --> Order for units and select Used (ms) / Ready (ms)
do you have vROPs available?
can you please let us know hoy many vCPU do this VMs have configured?
Being the case that you have high CPU ready (contention) you might consider two options:
1. Downsize the VM's CPUs (vROPS would be very useful for this)
2. Make a reservation, actually you would translate the issue to other VMs.