This is beginning to sound more like an inter-departmental issue than a technical one. Since the information is somewhat sketchy, my answer will be too ![]()
If you have 'a couple of hundred VMs' on 5 hosts, maybe there are too many VM's on a single host, causing too much outage when something goes wrong. Probably not something you can do anything about, but you could advise a different configuration (more hosts, force dev's together on a single host, limit them, etc)
You could ask for monitoring privileges to your own VM's, and monitor them yourself, making it easier to advise. VMware has excellent graphing and reporting for that. You could ask to be notified when certain VM's CPU or memory usage hits a certain threshold (this is an automated process that is easy to enable).
You could ask your exec's to have the issue of a VM causing all VM's to crash (?) thoroughly investigated. If you have too little privileges to do that, your vendor (who are clearly in control of the entire cluster) should provide an answer to such a strange (and destructive) event.
On best practices, simply google the words: vmware best practices cluster
.. and you'll get some excellent resources, some from VMware themselves. It would kind of waste forum space if you don't read those first.
On resource pools, here's a good read about them, and how they should be maintained:
When Bad Resource Pools Happen to Good People - Journey to the Cloud
Looks like you're given a task to complete with one hand tied behind your back, and partly in the dark. I don't envy you. ![]()