I have 2 clusters in my vSphere datacenter. One is local (with 2 ESXi hosts running 5.1.0,1065491) and the other is remote (also with 2 ESXi hosts running 126.96.36.1993097). All four hosts are DELL PowerEdge R620. The intermittent problem I'm having is that when I vmotion or clone a VM from my local environment to the remote environment, the destination host loses connectivity to vcenter quite frequently (about 1/3 times). I have to restart management network service and/or remove amd rejoin host to cluster to get it connected again. I see this happening both destination hosts. Any ideas?
I'm wondering if others have experienced the same and how they have resolved it.
Can you explain little bit more on what you mean by 'local' and 'remote' clusters ? Is the physical locations of the local and remote clusters are different ? Do you have shared storage connectivity between the clusters ? Is it SAN or NAS which is spread across clusters which are located in different geographical locations? Do you have high latency (How much is DAVG value ?) between the remote ESX cluster hosts and the SAN/NAS storage (which i assume is located in a different geographical location). Are you trying to do live vMotion of the VMs? If yes, then is vmotion network between your local and remote ESX hosts are from same subnets ?
Yes, the physical locations of both clusters are different. One in the server room (local office) and the other in a collocation (remote). Yes, they have shared storage connectivity between them. SAN storage at both locations. Not sure what you mean by DAVG value. I've seen this happen when migrating a powered down VM, a live VM, and even when cloning a live VM. vmotion network between the two clusters are on different subnets.
Please note that vmotion across the two clusters definitely works except that the destination host (remote) sometimes loses connection to vCenter. No existing VMs are affected and if left until vmotion is complete, the connection loss resolves itself.
All connections are 10G Fiber on both sides and between the two locations.
Okay. That explains a lot. Since you don't have commin shared storage between the 2 sites, when u do a vmotion, vcenter will move the VM to remote host and also to remote storage. So it is both vmotion as well as storage vmotion happening, one after another. This is going to be a network copy, of VMs memory and VMs files over wan. It will not use vmotion network and fiber channel network. But uses only the WAN between the 2 sites. In this case Packet drops can happen, and retries will be there. It is not recommended to move a live vm by this method.
You said both the sites have SAN connectivity, but can you tell me if the ESX host in remote site have connectivity to SAN storage in local site, and vice versa ?
DAVG is the disk average latency between the esx host and the storage. You can see this in ESXTOP of the host.
Did you check the vmkernel logs of the destination host ? It will report why it loses connectivity to vcenter. Also like I said, the WAN copy which u are performing will use the ESX host management network, which is also used for communication with vcenter. So because of high traffic in management network, it is most likely to lose connection to vcenter. Once the copy is over, the network utilization comes down, and it comes to normal and connects back to vcenter. This is what it should be in your case.