VM's losing connectivity.

JASA1976 · ‎04-20-2011

Trying to log a support call with VM as I've looked on the forums and knowledge base as we've come up with an issue that we've not seen before.

If we reboot our vCenter server or add datastores that force the hosts to do a rescan we lose connectivity to our VM's briefly. VM's on the same host on the same virtual switch are fine, but VM's across hosts also lose connectivity.

vCenter is 4.1 and the hosts are now all upgraded to ESXi 4.1. vCenter server is virtual. Storage is fibre attached to an EMC Clariion

Initially not all the hosts were upgraded from 4.0 and I found a number of dead paths on the HBA's across all hosts. We found a few articles on all dead paths causing issues like what we had encountered, but these have now all been removed and the remaining hosts upgraded.

Ths issue appears to be only since we started work on replacing our old SAN and presented the storage from the new SAN to the hosts.

Any one any ideas?

We run 24x7 so can't easily make changes and test without advanced warning to the users which makes it harder to troubleshoot.

HUNGAMA · ‎04-20-2011

Which CX Array are u using?

Can you also check the logs on /var/log/messages and see if you are getting large no of NMP lost sensor issues?

JASA1976 · ‎04-20-2011

We have a CX3 and a CX4.

I'm having to wait for another testing window before I can check the logs again.

HUNGAMA · ‎04-20-2011

Are you using ESXi or ESX? You can check the logs anytime and you don't need a testing time for that. I think I have seen an issue almost like this before. CX is an active / passive array and LUN trespassing sometimes causes the issues.

JASA1976 · ‎04-20-2011

We use ESXi on the hosts.

The logs don't go far enough back to the last time I was able to test.

HUNGAMA · ‎04-20-2011

All right.

Did you check the logs during normal BAU and see any vmkernel lost sensor errors?

opbz · ‎04-20-2011

Couple of things to check.

1 virtual VC needs 2 cpus ensure you have that. I seen it work but hang with just one

2: Are you connecting to boths CX using the same HBAS? Should work just ensure you have the right policy enabled for the luns and that your zoning on your switches is single initiator ( only 1 HBA seeing multiple SP ports) If your flare on the CXs is ALua then you should be set to either Fixed or Round Robin. Otherwise it should be set to MRU

3: are your HBAs, switches and SPs configured to autonegotiate at the same speed?

4: how is your networking configured? Vms should be on separate network from the management.

5: how are you testing connectivity to vms?

when they loose connectivity can they still ping over Vms on same vswitch?

does connectivity ever come back or do you have to do something?

5: might be an idea to use vcheck ( from http://www.virtu-al.net/2010/03/26/vcheck-v5/) or some other equivalent utility to check the overall status of your environment.

6: should not cause problems but if your luns are on same hbas for both cx might be an idea to check host ids are not similar for the luns presented ( by the way this can only be changed when adding luns)

good luck

JASA1976 · ‎04-20-2011

1. vCenter has 2 CPU's and 4GB ram. I can't see why the spec of this VM would have an impact when its rebooted. I'd have expected to see issues during bau

2. We do connect to both SAN's using the same HBA's. EMC configured the zoning to be 1HBA to multiple SP's and we use MRU

3. Everything is 4gb.

4. Need to double check, but traffic should be on a seperate network.

5. Connectivity comes back with out us having to do anything once the vCenter server has restarted on HBA's have been scanned, but you lose remote connectivity to the vm's on a host during that time. They can communicate on the same virtual switch on a host.

6. Need to double check as they were set up by EMC.

Thanks for all comments so far.

HUNGAMA · ‎04-20-2011

EMC CX is an active/passive array so MRU is the policy to use.

Your VMs will freeze ro hang as soon as you initiate LUN trespassing on CX.

The zoning is all right but you have to manually adjust the paths on the esxi hosts.

Check the log on CX and you might find that lots of trespassing is happening. Check and confirm that current lun owner and the original lun owner SP is same. If not adjust them.

Accordingly adjust the paths on esxi to balance the load but make sure that if spa is the owner hba0 should be the preferred path and if spb is the owner hba1 should be the preferrrecd path

JASA1976 · ‎04-20-2011

I've checked the LUN's and a couple have had a trespass, but other haven't trespassed at all.

malaysiavm · ‎04-20-2011

you may consider to turn on ALUA and set it to active active mode follow emc recommendation

Craig vExpert 2009 & 2010 Netapp NCIE, NCDA 8.0.1 Malaysia VMware Communities - http://www.malaysiavm.com

opbz · ‎04-20-2011

CX luns tresspass only as a last resort. It will only tresspass as lun if there is atempted traffic and it can not find any path to the same SP.

Would suggest you check storage adaptors and paths for your HBAs

it might be the case that you have issues with one or more paths to a LUN. Suggest checking on the server that hosts Virtual center but you might need to check all your esx servers. This is visible also from navisphere or from your switches.

Basically a failed path on 1 of your ESX servers can cause the paths to tresspass.

Main thing here is that MRU will not casue lun flapping, It is reliable. Once you sort this issue you might want to change it to round robin if your flare supports alua. if it does not mru will work fine but RR gives you better performance.

Check your system logs on the ESX servers (do not reboot them as this will clear your logs in ESXi unless you have syslog configured) might give youm ore details as to lun/path failures ones to look are /messages and vmkernel logs

NuggetGTR · ‎04-20-2011

sounds like the APD bug to me,

you running 4.1 update 1? as it was only fixed in update 1.

SSH to one of the ESX hosts, go to /vmfs/volumes/

and do a ls -l

If this takes some time to think its more than likely the APD issue, it should list all the volumes if any show up red then its the APD bug. these will not show up in vCenter everything may look all good but its the ESX host holding onto a volume that was removed incorrectly at some point.

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager

JASA1976 · ‎04-21-2011

Checked and I don't have any volumes showing up as red.