Trying to log a support call with VM as I've looked on the forums and knowledge base as we've come up with an issue that we've not seen before.
If we reboot our vCenter server or add datastores that force the hosts to do a rescan we lose connectivity to our VM's briefly. VM's on the same host on the same virtual switch are fine, but VM's across hosts also lose connectivity.
vCenter is 4.1 and the hosts are now all upgraded to ESXi 4.1. vCenter server is virtual. Storage is fibre attached to an EMC Clariion
Initially not all the hosts were upgraded from 4.0 and I found a number of dead paths on the HBA's across all hosts. We found a few articles on all dead paths causing issues like what we had encountered, but these have now all been removed and the remaining hosts upgraded.
Ths issue appears to be only since we started work on replacing our old SAN and presented the storage from the new SAN to the hosts.
Any one any ideas?
We run 24x7 so can't easily make changes and test without advanced warning to the users which makes it harder to troubleshoot.
Are you using ESXi or ESX? You can check the logs anytime and you don't need a testing time for that. I think I have seen an issue almost like this before. CX is an active / passive array and LUN trespassing sometimes causes the issues.
1 virtual VC needs 2 cpus ensure you have that. I seen it work but hang with just one
2: Are you connecting to boths CX using the same HBAS? Should work just ensure you have the right policy enabled for the luns and that your zoning on your switches is single initiator ( only 1 HBA seeing multiple SP ports) If your flare on the CXs is ALua then you should be set to either Fixed or Round Robin. Otherwise it should be set to MRU
3: are your HBAs, switches and SPs configured to autonegotiate at the same speed?
4: how is your networking configured? Vms should be on separate network from the management.
5: how are you testing connectivity to vms?
when they loose connectivity can they still ping over Vms on same vswitch?
does connectivity ever come back or do you have to do something?
5: might be an idea to use vcheck ( from http://www.virtu-al.net/2010/03/26/vcheck-v5/) or some other equivalent utility to check the overall status of your environment.
6: should not cause problems but if your luns are on same hbas for both cx might be an idea to check host ids are not similar for the luns presented ( by the way this can only be changed when adding luns)
1. vCenter has 2 CPU's and 4GB ram. I can't see why the spec of this VM would have an impact when its rebooted. I'd have expected to see issues during bau
2. We do connect to both SAN's using the same HBA's. EMC configured the zoning to be 1HBA to multiple SP's and we use MRU
3. Everything is 4gb.
4. Need to double check, but traffic should be on a seperate network.
5. Connectivity comes back with out us having to do anything once the vCenter server has restarted on HBA's have been scanned, but you lose remote connectivity to the vm's on a host during that time. They can communicate on the same virtual switch on a host.
6. Need to double check as they were set up by EMC.
Thanks for all comments so far.
EMC CX is an active/passive array so MRU is the policy to use.
Your VMs will freeze ro hang as soon as you initiate LUN trespassing on CX.
The zoning is all right but you have to manually adjust the paths on the esxi hosts.
Check the log on CX and you might find that lots of trespassing is happening. Check and confirm that current lun owner and the original lun owner SP is same. If not adjust them.
Accordingly adjust the paths on esxi to balance the load but make sure that if spa is the owner hba0 should be the preferred path and if spb is the owner hba1 should be the preferrrecd path
you may consider to turn on ALUA and set it to active active mode follow emc recommendation
CX luns tresspass only as a last resort. It will only tresspass as lun if there is atempted traffic and it can not find any path to the same SP.
Would suggest you check storage adaptors and paths for your HBAs
it might be the case that you have issues with one or more paths to a LUN. Suggest checking on the server that hosts Virtual center but you might need to check all your esx servers. This is visible also from navisphere or from your switches.
Basically a failed path on 1 of your ESX servers can cause the paths to tresspass.
Main thing here is that MRU will not casue lun flapping, It is reliable. Once you sort this issue you might want to change it to round robin if your flare supports alua. if it does not mru will work fine but RR gives you better performance.
Check your system logs on the ESX servers (do not reboot them as this will clear your logs in ESXi unless you have syslog configured) might give youm ore details as to lun/path failures ones to look are /messages and vmkernel logs
sounds like the APD bug to me,
you running 4.1 update 1? as it was only fixed in update 1.
SSH to one of the ESX hosts, go to /vmfs/volumes/
and do a ls -l
If this takes some time to think its more than likely the APD issue, it should list all the volumes if any show up red then its the APD bug. these will not show up in vCenter everything may look all good but its the ESX host holding onto a volume that was removed incorrectly at some point.