VMware Cloud Community
mariodez
Contributor
Contributor

Best practices for a large cluster

I am in the process of making a recommendation to a service provider that works for us on how our VMWare environment can be better managed to protect high value application servers.  The cluster is setup with a couple hundred VMs in one cluster with 5 hosts.  Right now HA is enabled but there is no differentiation in regards to what servers are prioritized and no grouping to show environments or even sort by application.   We have had one instance were a rouge development server over consummed resources for a host and took down all VMs on that host including production servers.   My concern is that this will happen again and a that I need to proactively recommend ways to protect our servers.   I have so far recommended that they possible consider grouping VMs into resource pools by application then by environment that way priority and reservations can be set and organized efficiently.   Is this a good approach or are there other best practices to consider?

Reply
0 Kudos
7 Replies
Anjani_Kumar
Commander
Commander

I believe if i was in your place i am also going to prioritize the vm's according to the resource pool application .

This the best and the first step to configure the vm infra.

Please consider marking this answer "correct" or "helpful" if you found it useful. Anjani Kumar | VMware vExpert 2014-2015-2016 | Infrastructure Specialist Twitter : @anjaniyadav85 Website : http://www.Vmwareminds.com
Reply
0 Kudos
JPM300
Commander
Commander

I think resource pools are a good start as you can set limits and reservation.  Just make sure that when you setup your resource pools on the dev VM's or the VM's that seem to have run away processes that crashed the system last time, you don't set unlimited reservations, otherwise it will always pull resources from the parent pool if it can.

What time of resource pool are you thinking of?  Can you give us a break down?


Another option if you wanted is you can setup groups of servers into vAPP's then limit resources this way, just another option to look at.

Reply
0 Kudos
kluken
Contributor
Contributor

I would be interested in how one app took down the VMs and host. Do you have DRS and SIOC enabled? More details on how your hosts are configured with NICs, VLANs as well as storage such might be helpful. I have had guests that drive the host CPU to 100% and never impacted anything else as DRS and HBA tuning, etc all prevented that from happening. In the one cluster that we do mix DEV and PROD on we set the resources shares to DEV boxes at low, all other to normal and business critical VMs to HIGH for CPU and RAM. I am not a fan of resource pools as they create as many problems as they fix.

Reply
0 Kudos
Jeroenix
Contributor
Contributor

I would also restrict VMs memory/processor/IOPs usage of the development machines by using resource pools. What you also might want to look at are "VM rules". You could configure things so, that certain groups of VMs would run only on certain designated hosts, thereby forcing all the dev/test machines together on a certain host, reducing the risk to production.

To make your production machines more robust, you could use those same VM rules to spread certain cluster-aware application servers across the ESXi-cluster, so for example your SQL mirrors or Domain Controllers won't be on one physical machine. This can of course only be applied if you have such servers.

VM rules are found under DRS in the cluster properties.

Reply
0 Kudos
mariodez
Contributor
Contributor

Here is a little more background on the issues we are facing.  I have limited visibility into the vCenter as it is hosted by another vender and they have only given us limited walk-throughs to calm concerns, which only ended up raising additional concerns.   Due to the limited visibility at this time I don't have any of the specific configuration information.  The information that I can give you is that our company is split into three IT departments for three major but seprate business functions.  In the vCenter cluster all application servers across the enterprise are lumped into one cluster.  So with this said we do not have information on the other departments VMs and what functions/environments they serve.  We do know our departments VMs inside and out along with the importance and resources they require.  I have been tasked by our executives for my side of the business to make sure that our environment is protected and need to see what the best practice is to do that.  My initial thought was to recommend that the vendor have a container created at the top level called "My Department" within that pool we would create resource pools for Prod, Test, and Dev and assign the appropriate controls, priorites to each pool.   I saw in one reply that this may not be the best as resouce pools can hurt just as much as help, can you give me some of the pros/cons of this?   I am sorry things are kind of vague but I have limited information to go on myself.  As far as I know DRS and SIOC are not enabled.  With our limited visibilty and the vendors lack of transparency we were unable to tell what server caused the outage and why.  All we can do is look to protect our VMs as best as we can.  Any best practices documentation that can be directed my way for this type of scenario would be much appreciated.

Reply
0 Kudos
Jeroenix
Contributor
Contributor

This is beginning to sound more like an inter-departmental issue than a technical one. Since the information is somewhat sketchy, my answer will be too Smiley Happy

If you have 'a couple of hundred VMs' on 5 hosts, maybe there are too many VM's on a single host, causing too much outage when something goes wrong. Probably not something you can do anything about, but you could advise a different configuration (more hosts, force dev's together on a single host, limit them, etc)

You could ask for monitoring privileges to your own VM's, and monitor them yourself, making it easier to advise. VMware has excellent graphing and reporting for that. You could ask to be notified when certain VM's CPU or memory usage hits a certain threshold (this is an automated process that is easy to enable).

You could ask your exec's to have the issue of a VM causing all VM's to crash (?) thoroughly investigated. If you have too little privileges to do that, your vendor (who are clearly in control of the entire cluster) should provide an answer to such a strange (and destructive) event.

On best practices, simply google the words: vmware best practices cluster

.. and you'll get some excellent resources, some from VMware themselves. It would kind of waste forum space if you don't read those first.

On resource pools, here's a good read about them, and how they should be maintained:

When Bad Resource Pools Happen to Good People - Journey to the Cloud

Looks like you're given a task to complete with one hand tied behind your back, and partly in the dark. I don't envy you. Smiley Happy

Reply
0 Kudos
vfk
Expert
Expert

It is quite important that you get access to the environment, even read only privileges if you are to progress with this.  It is quite difficult to make recommendation without insight into the environment.  If you have a dedicated vCenter for your setup, then this request should not be problem if you go about it the right.   But if you are sharing vCenter with other customers, sometimes this is a common practise in hosted environment, then I can see why the vendor can be little reluctant.  Also consider getting other IT departments involved to resolve the issues.

--- If you found this or any other answer helpful, please consider the use of the Helpful or Correct buttons to award points. vfk Systems Manager / Technical Architect VCP5-DCV, VCAP5-DCA, vExpert, ITILv3, CCNA, MCP
Reply
0 Kudos