Solved: Re: Migration Issue

cmcrci · ‎12-15-2012

We have a cluster with 6 ESX hosts.(3.5).with one resource pool having resources 16525mhz and 132213MB.no reservations and shares normal.and around 10 vms are outside the resource pool.

ESX host configuration: HP PROLIANT ML370G5

processor: 8*2.499GHZ

HT: Inactive

Issue: the cluster is enabled with HA and DRS . DRS configured with fully automated.we do have enough resources available on ESX(3.5). But still alot of VM migrations are going on inside the cluster. Dont have any clue why this is happening. We do have lot of logs regarding resource allocation changed logs for the VMs .And becuase of these migrations performance of the VMs is very slow

All the mounted datastores are NFS which we used to host VMs.

Vcenter 2.5

Please help me in resolving this. Thanks in advance.

CMCRCI

depping · ‎12-16-2012

You will need to check cluster imbalance. If the cluster is in an imbalance, for whatever reason, then DRS will try to balance it. You can find some details about this procedure here:

http://www.yellow-bricks.com/drs-deepdive/

View solution in original post

barcex · ‎12-15-2012

Hi

It is hard to say what it is hapening with what you are telling. However I will try to give you some clues that could be useful to troubleshoot the issue.

First of all, we are talking about a DRS issue. So let's forget HA.

The migrations are being done because DRS is in fully automated mode. In this mode DRS makes the movements it thinks could benefit the performance of the whole cluster while honoring the configured resource priorities (resouce pools, shares, etc..)

You have a slider in the DRS configuration panel that sets the aggressivenes of the fully automated mode. Tweaking that value you will get more or less movements. Put it in a less agressive value and you will probably achieve the result of having less migrations.

However, it is unlikely that the root cause of your performance issues are the automatic vMotion movements. Usually vMotion does not affect the VM performance in a noticeable way. It could hog your vMotion network and cause networking performance issues if on that interfaces you are also transmitting virtual machine data.

The most probable cause is that the automatic migrations are just a symptom of your real performance problems, not the problem itself. Disable the automatic mode for a couple of hours and you will see how the whole system performs without automatic movements. There is no risk in this.

I guess that most probably your root cause is a shortage of CPU or memory resources. Try to see if the hosts are higly loaded. Check the cluster imbalance value in the "Summary" tab of your cluster. Check the amout of ballooning and swapping on your hosts. These could be indications of memory problems.

Last, the practice of having virtual machines that are siblings of resource pools is a very bad practice. The very few virtual machines that are not on the resource pool could be impacting the performance of the machines that are inside of the resource pool. This is a well known problem. Check on the resource allocation the shares and percentages of CPU resources assigned to the machines that are aoutside the cluster in comparison with the pool. You could end up in situations like this:

RPOOL1 (25%)

----> VMRPA (25%)

----> VMRPB (25%)

----> VMRPC (25%)

----> VMRPD (25%)

VM1 (25%)

VM2 (25%)

VM3 (25%)

Notice that in the example, VMRPA is entitled to 25% of the resources of RPOOL1, which in turn is entitled to 25% of the resources of the cluster. This means that VMRPA is entitled to only 6% of the resources of the whole cluster while VM1 for instance is entitled to 25% of the resources of the cluster. If you had 20 virtual machines in RPOOL1, each machine could be just entitled to 2,5% of the resources of the cluster, that is 10 times less than VM1.

I hope these clues help you to troubleshoot the problem.

Best,

---- VCP4 VCP5 VCAP5-DCA / CCNA CCNP CCIP / MCTS MCITP

cmcrci · ‎12-16-2012

Thanks for your reply , its very helpful. But my question @ why the migrations are happening that too very frequently, i am able to see that the resource usage of ESXs are having almost half of the resources available and i am sure that there is no resource contention as per the ESX. Is there any other way to check the performace of ESX or if i am missing some thing please guide me.

If any other information is needed please let me know.

Thanks for your support.

depping · ‎12-16-2012

You will need to check cluster imbalance. If the cluster is in an imbalance, for whatever reason, then DRS will try to balance it. You can find some details about this procedure here:

http://www.yellow-bricks.com/drs-deepdive/

cmcrci · ‎12-17-2012

But my esx vesrion is 3.5 and vcenter 2.5. The article you suggested is for 4.1.Please help me out how to calculate the load and i am seeing swapping and memctl in 2 to 3 hosts out of 6 in the cluster by using esxtop. But this is not showing in esx resource usage in GUI but showing swapping and ballooning in performance graphs.

In 3.5 we cant see standard deviaition and cluster balancing. How to check this through command line or any other tool available.Please suggest.

Thanks in advacnce.

nightnicon · ‎12-17-2012

You can still see the DRS distribution in VC 2.5 and ESX 3.5

Click on the cluster and then the summary tab you will see a VMware DRS Resource Distribution chart.

this chart will give you an idea of whether the cluster is able to provide the resources that the VM's are requesting and is a good starting point.

cmcrci · ‎12-18-2012

In ESXTOP output: The memory state is high but still i am able to see ballooning and swapping on 2 to 3 esx hosts. I am confused. How come is this possible,the resources are not overcommitted. How to know the exact issue ?

up 490 days 13:52, 136 worlds; MEM overcommit avg: 0.00, 0.00, 0.00

PMEM /MB: 64510 total: 800 cos, 584 vmk, 32987 other, 30138 free

VMKMEM/MB: 62961 managed: 3777 minfree, 8625 rsvd, 54106 ursvd, high state

COSMEM/MB: 133 free: 1600 swap_t, 1249 swap_f: 0.00 r/s, 0.00 w/s

PSHARE/MB: 24524 shared, 4975 common: 19549 saving

SWAP /MB: 42 curr, 28 target: 0.00 r/s, 0.00 w/s

MEMCTL/MB: 1014 curr, 1014 target, 32882 max

Please suggest.

Thanks in advance.

barcex · ‎12-18-2012

Hi

I do not see a big problem in your ESXTOP output ( The quoted paragraph comes from

http://communities.vmware.com/docs/DOC-9279)

"state" : the free memory state. Possible values are high, soft, hard and low. The memory "state" is "high", if the free memory is greater than or equal to 6% of "total" - "cos". If is "soft" at 4%, "hard" at 2%, and "low" at 1%. So, high implies that the machine memory is not under any pressure and low implies that the machine memory is under pressure.

Your state is high, that is good.

The swap value is negligible, 42 MB over a total of 64 GB, that is 0.06% of the total memory.

The ballon value is not high at all, 1.6% of the total memory. Unless all the balloon figure comes from a single virtual machine, this should not be a real problem. It could be that you have set memory limits on a resource pool or virtual machine (which is unlikely) but I'm just guessing. In any case, it does not seems to be bad.

Best,

---- VCP4 VCP5 VCAP5-DCA / CCNA CCNP CCIP / MCTS MCITP

vGuy · ‎12-18-2012

do you have any users complaining about the performance issues? You may want to drill down to the VM/s (using esxtop VM view) causing these spikes and look at their config for current usage, limits, and so on..

on a side note good to see savings from TPS for upto 19GB

cmcrci · ‎12-19-2012

Yes, production servers are hosted and many of the VMs are responding slowly time to time and whne i open a RDP session there is lagging to login to the server. Some times not able to login and the sessions stuck. Please help me out and we are having 75 vms and till now app. 20000 DRS migrations happened. My fingures crossed and i am not getting any thing why this is happening.

Thanks in advance.

vGuy · ‎12-19-2012

can you post the output of esxtop's VM view?

vGuy · ‎12-19-2012

Below are my observations after reviewing your esxtop screenshots:

Host1:

CPU: no apparent issues visible....%rdy seems busy but not alarming

Mem: ballooning and swapping on the second last VM but SWR/s and SWW/s is 0.0 and the host is in high state which means it might have swapped/ballooned at one point due to overcommitment but not anymore

Host2:

CPU: high CPU rdy time on second last VM. See if you can downgrade the vCPUs.

Mem: no apparent issues visible.

Host 3:

CPU: Again high CPU rdy time on couple of VMs (30.17 and 14.66)...try reducing the no. vCPUs.

Mem: no apparent issues visible.

Host 4:

CPU: no apparent issues visible....%rdy seems busy but not alarming

Mem: no apparent issues visible.

Host 5:

CPU: no apparent issues visible....%rdy seems busy but not alarming

Mem: on the second last VM, MCTL is "N", ensure VMware tools are installed and running...other than that no apparent issues visible.

Host 6:

CPU: Again high CPU rdy time on couple of VMs (11.58, 21.05 and 11.52)...try reducing the no. vCPUs or migrate the VMs to the host with more cores to cater the workload.

Mem: ballooning and swapping on the first VM but SWR/s and SWW/s is 0.0 and the host is in high state which means it might have swapped/ballooned at one point due to overcommitment but not anymore...also on the third last VM, MCTL is "N", ensure VMware tools are installed and running

In addition,

--> Please ensure you are running on latest patch and update level since I remember someone reported issues similar to this in 3.5 u4 and were fixed once updated to u5.

--> ensure all the VMs are running with the updated vmware tools.

--> try to disable and reenable DRS on the cluster as well.

--> And ensure no memory limits are configured on the VMs.

--> just in case have a look at the counters for storage for any possible latency issues.

....hth and let us know how it goes!

cmcrci · ‎12-19-2012

Could you please guide me , how to provide storage latency values. We are having NFS staorage and in logs some times we receive file locking logs and connection reset logs.

Thanks in advance

vGuy · ‎12-20-2012

I am not sure if there is an NFS specific esxtop stats in ESX 3.5..you may check your network stats for any dropped packets.

If not already implemented, try to separate your NFS traffic from other traffic types, ensure your Guest OS Filesystems are aligned and so on. I am attaching the NetApp's NFS Best practices guide for VI3 which can assist you in verifying your config from ESX end while some of the Array recommendations should still be valid for non-NetApp arrays.

Message was edited by: vGuy

depping · ‎12-20-2012

NFS esxtop metrics were not introduced until late 4.1.

cmcrci · ‎12-21-2012

FYI, The N represented in esxtop memory are citrix servers which are having memctl disabled intentionally. And the the othet 2 servers with ballooning are provided with 4Gb ram but limit is applied at 2 GB. MAy be because of this ballooning and swapping are going on.

So there are no issues with resources, . But again my question is if there is no issue then why these many migrations happening(19000).

Thanks in advance.

barcex · ‎12-21-2012

A couple of things:

- Disabling ballooning (memctl) is a very bad idea. Instead of having ballooning now you can have swapping, which is much worse than balloning. If you want to guarantee a certain amount of physical memory to a VM avoiding ballooning use memory reservations.

- Setting memory limits is also an awful idea. Never do it other than for testing prouposes. If you want your VM to be limited to 2 GB just change the configured memory to from 4 GB to 2 GB

Best,

---- VCP4 VCP5 VCAP5-DCA / CCNA CCNP CCIP / MCTS MCITP

cmcrci · ‎12-21-2012

@ barcex: The ballooning is not disabled for any particular reason( memory or so) . In general its not a good practice to enable memctl on terminal servers and on database servers.

http://www.slideshare.net/vmug/vmware-performance-tuning-by-virtera-jan-2009-presentation#btnNext

In slide 26 it is mentioned. Regarding memory limits i agree with you and i am discussing the same with my team lead.

I am just 1+ years of experience in vmware and please correct if i am wrong.

I am very thankful to each and every one who are supporting.

Regards

barcex · ‎12-21-2012

Hi

I'm sorry to tell you that that slide #26 is wrong. It talks about the sync driver (vmsync) and mentions the balloon driver (vmmemctl). They are different things, and it will not be hard to find people suggesting to turn the sync driver off on databases.

However, you should never turn off the balloon driver (vmmemctl). It is one of the memory reclamation techniques and the smartest one by the way. When in periods of memory contention your virtual machine will have an amount of entitled memoy. All the memory above that entitlement value may be reclaimed by the host, either using ballooning or using swapping.

If memory has to be reclaimed from a VM you would like that memory to be reclaimed by the balloon driver because it interacts with the guest OS and reclaims free or not recently used memory pages. If you turn off the balloon driver that memory will be reclaimed using swapping, and as it does not cooperate with the guest OS it will chose random memory pages, potentially swapping out very important non-free pages. Therefore swapping is always worse than ballooning. There are some benchmarks out there that clearly show that swapping is much worse than ballooning indeed.

If you want to guarantee a certain amount of memory to be always available to a VM, then set a memory reservation. If you don't, do not use reservations. However, never disable the balloon driver because of the aforementioned reasons.

Best,

---- VCP4 VCP5 VCAP5-DCA / CCNA CCNP CCIP / MCTS MCITP

depping · ‎12-24-2012

cmcrci wrote:
@ barcex: The ballooning is not disabled for any particular reason( memory or so) . In general its not a good practice to enable memctl on terminal servers and on database servers.
http://www.slideshare.net/vmug/vmware-performance-tuning-by-virtera-jan-2009-presentation#btnNext
In slide 26 it is mentioned. Regarding memory limits i agree with you and i am discussing the same with my team lead.
I am just 1+ years of experience in vmware and please correct if i am wrong.
I am very thankful to each and every one who are supporting.
Regards

I think that the person who created that slide didn't really understand how ESX(i) Memory Management works. I suggest you read the following blog article as it gives a far better explanation and an accurate one:

http://frankdenneman.nl/memory/disable-ballooning/

PS: Frank is a VMware employee and part of the Technical Marketing team, responsible for Resource Management... so you can count on this information being spot on.