VMware Cloud Community
lvaibhavt
Hot Shot
Hot Shot

Check for risks in VMware environment

Hi All,


My manager has asked me to check for risks which might be there in our VMware environment.

Can you please suggest me on what to check and how to proceed.




Thanks

Reply
0 Kudos
2 Replies
Wh33ly
Hot Shot
Hot Shot

Check for incompetent users/colleague's  Smiley Happy

Most incidents are caused by human interaction (changes, implementations, sizing etc.)

When you don't do much changes on infrastructure level, it mostly keeps running for a while, unless you have some hardware broken, power interrupts, 3rd party backbone crashes.

But I mainly see things go wrong as things change (sometimes multiple things change without teams knowing about the impact for others etc) these can cause some serious headaches before you figure out what happened at the end of the line it wasn't even your fault but the storage/network/hardware team's fault Smiley Happy

Make sure you have good connections with the teams/people that can interrupt your environment, this will give you some heads-up when things changing or things are being planned to change. You could do your own impact analysis and know when things go wrong where to look instead of starting somewhere and going through your troubleshoot list.

Some other things on my mind which can be risks...

- Backup & Recovery / Failover plan (and tested)

- Redundant network/storage/power connections

- Redundant hosts/network/storage etc. (try to make the environment as clean and easy as possible, use a good naming convention for ESX hosts, network, storage; keep them patched; prefer cluster with same hardware; same installation images etc.avoid making a mess)

- Performance check for over-/under subscribed VM's, try to size them right for their usage and try to size the rest of your clusters/hosts to your VM environment.  A lot of performance issues can come from this part. HA capacity what if a host crashes, can you guarantee performance or isn't it necessary ?

- Hardware contracts? What if hardware is broken? Can you call your vendor or do you have to go to the nearest hardware shop to get some parts? Pro's / Con's vs Money.

- Capacity planning for your own environment with a sidenote ; sometimes our storage guys tell us they're almost out of storage space (capacity planning ? right...)  but they need to order a few storage boxes, which costs money, delivery/installation time etc. Make sure you get informed fast enough so you can anticipate on this kind of things. It's not your fault, until you run out of datastore spaces, and customers come complain why they can't extend any more disks, deploy servers etc...then it's your fault (from customers view)! So in this case I try to keep some storage in my pocket (Gnagna) so in case of emergency I can pop it in somewhere or use it as leverage...

- Check SLA or customer expectations. If you can recover a site in 2 days but the customer SLA shows it has to be done in 4 hours you'll have a problem.

As you can see these are not only technical risks, like I noted before if you don't change much it's stable as hell.

So hope it gets you going ...somewhere....with lesser risks....

Reply
0 Kudos
markdjones82
Expert
Expert

You can also run something like the Vcheck script that has check for a lot of best practices

http://www.virtu-al.net/vcheck-pluginsheaders/vcheck/

@markdjones82 | http://nutzandbolts.wordpress.com

http://www.twitter.com/markdjones82 | http://nutzandbolts.wordpress.com
Reply
0 Kudos