We are currently noticing some strange performance issues with 2 of our nodes in an 8 node cluster. I think the weirdness might be related to a processor incompatibility but it's running on older hardware that doesnt support EVC. we have 6 nodes with intel xeon 7040's and 2 nodes with 7140's.
There have been instances where a vm will vmotion onto the 7140 machines and performance will drop so much that the vm will basically stop responding even if the physical node load is low. the only way to bring the vm back to life is to vmotion it back onto the 7040 machines. The strange thing is that anything already running on the 7140's is unaffected and it doesnt seem to affect all of our vm's.
In the short term while we try to troubleshoot this issue, we want to exclude the 2x 7140 machines from participating in DRS decisions, however i cant find any documentation to say if this is possible.
It's likely to only be a short term problem, we are investigating cycling out the troublesome hardware as it's pretty old now but we also need to keep our critical apps up until we can get this done..
I may be able to achieve the desired outcome through using DRS groups and creating rules to ensure our critical apps remain on the 7040's and arent allowed to vmotion onto the 7140 machines, just wondering if anyone else had any other suggestions?
Welcome to the VMware Communities forums.
Are still wanting to run VMs on the problem hosts for testing purposes? If not, you can just place the hosts in Maintenance mode and they'll not be able to run any VMs. If you need to run VMs for testing you could use the DRS rule "should not run on this host". Alternatively if you have a simple VM port group setup you could delete those port groups on the problem hosts.
yeah, we still want to run vm's on there - our cluster doesnt have the capacity to drop the 2 nodes that arent playing nice. The problem only seems to arise when vmotion drops a machine onto these nodes and even then we cant consistently replicate the problem..
my plan was to keep our critical apps like exchange, ocs, sharepoint etc running on the 6 nodes that we know are good. if a test machine happens to bomb out after a vmotion, then i'm not too bothered, plus anything already running on the troublesome nodes seems fine - its only seems to be a vmotion event that triggers it
thanks for the tip, i'll try the DRS rule to see if we can get around the problem..