Which CX Array are u using?
Can you also check the logs on /var/log/messages and see if you are getting large no of NMP lost sensor issues?
We have a CX3 and a CX4.
I'm having to wait for another testing window before I can check the logs again.
Are you using ESXi or ESX? You can check the logs anytime and you don't need a testing time for that. I think I have seen an issue almost like this before. CX is an active / passive array and LUN trespassing sometimes causes the issues.
We use ESXi on the hosts.
The logs don't go far enough back to the last time I was able to test.
Did you check the logs during normal BAU and see any vmkernel lost sensor errors?
Couple of things to check.
1 virtual VC needs 2 cpus ensure you have that. I seen it work but hang with just one
2: Are you connecting to boths CX using the same HBAS? Should work just ensure you have the right policy enabled for the luns and that your zoning on your switches is single initiator ( only 1 HBA seeing multiple SP ports) If your flare on the CXs is ALua then you should be set to either Fixed or Round Robin. Otherwise it should be set to MRU
3: are your HBAs, switches and SPs configured to autonegotiate at the same speed?
4: how is your networking configured? Vms should be on separate network from the management.
5: how are you testing connectivity to vms?
when they loose connectivity can they still ping over Vms on same vswitch?
does connectivity ever come back or do you have to do something?
5: might be an idea to use vcheck ( from http://www.virtu-al.net/2010/03/26/vcheck-v5/) or some other equivalent utility to check the overall status of your environment.
6: should not cause problems but if your luns are on same hbas for both cx might be an idea to check host ids are not similar for the luns presented ( by the way this can only be changed when adding luns)
1. vCenter has 2 CPU's and 4GB ram. I can't see why the spec of this VM would have an impact when its rebooted. I'd have expected to see issues during bau
2. We do connect to both SAN's using the same HBA's. EMC configured the zoning to be 1HBA to multiple SP's and we use MRU
3. Everything is 4gb.
4. Need to double check, but traffic should be on a seperate network.
5. Connectivity comes back with out us having to do anything once the vCenter server has restarted on HBA's have been scanned, but you lose remote connectivity to the vm's on a host during that time. They can communicate on the same virtual switch on a host.
6. Need to double check as they were set up by EMC.
Thanks for all comments so far.
EMC CX is an active/passive array so MRU is the policy to use.
Your VMs will freeze ro hang as soon as you initiate LUN trespassing on CX.
The zoning is all right but you have to manually adjust the paths on the esxi hosts.
Check the log on CX and you might find that lots of trespassing is happening. Check and confirm that current lun owner and the original lun owner SP is same. If not adjust them.
Accordingly adjust the paths on esxi to balance the load but make sure that if spa is the owner hba0 should be the preferred path and if spb is the owner hba1 should be the preferrrecd path
I've checked the LUN's and a couple have had a trespass, but other haven't trespassed at all.
you may consider to turn on ALUA and set it to active active mode follow emc recommendation
CX luns tresspass only as a last resort. It will only tresspass as lun if there is atempted traffic and it can not find any path to the same SP.
Would suggest you check storage adaptors and paths for your HBAs
it might be the case that you have issues with one or more paths to a LUN. Suggest checking on the server that hosts Virtual center but you might need to check all your esx servers. This is visible also from navisphere or from your switches.
Basically a failed path on 1 of your ESX servers can cause the paths to tresspass.
Main thing here is that MRU will not casue lun flapping, It is reliable. Once you sort this issue you might want to change it to round robin if your flare supports alua. if it does not mru will work fine but RR gives you better performance.
Check your system logs on the ESX servers (do not reboot them as this will clear your logs in ESXi unless you have syslog configured) might give youm ore details as to lun/path failures ones to look are /messages and vmkernel logs
sounds like the APD bug to me,
you running 4.1 update 1? as it was only fixed in update 1.
SSH to one of the ESX hosts, go to /vmfs/volumes/
and do a ls -l
If this takes some time to think its more than likely the APD issue, it should list all the volumes if any show up red then its the APD bug. these will not show up in vCenter everything may look all good but its the ESX host holding onto a volume that was removed incorrectly at some point.
Checked and I don't have any volumes showing up as red.