VMware Cloud Community
Dreadman
Contributor
Contributor

Cloning Server 2003/ 2008 VM's to NTFS datastore (haneWin NFS) fails and VMhost is shown as disconnected in vCenter

Hi,

I'm facing a strange issue which is driving me nuts since months. Our setup:

- Four vCenter 4.1 Installations on Win 7 - 64 Bit (We can't upgrade to 4.2 before christmas)

- Many ESXi 4.1 hosts

- For Backup purposes: Win 7 64 Bit machine with some 16TB NTFS Volumes (4k cluster size, file compression is off) exported with haneWIN NFS Server (NFS v3, TCP only, max transfer size 64k). The Backup Script is written in Powershell with Powercli 5.

The problem:

Backup's (copy and clone) are runnning fine for most of our hundreds of VM's, but cloning some Server 2003/ 2008 to the NTFS datastore fails. This happens when cloning by script or manually using vSphere client.

- The error occurs not at the beginning of the clone operation, it tends to fail near the end  (after some hours).

- In vCenter the error message is "Error while communicating with the remote host".

- The source VMhost and all VM's on this host are shown as "disconnected" in vCenter, coming back online after some minutes.

- The disconnected host itself is fine, I can connect to it with vSphere client and all VM's are running, so only the connection to vCenter gets lost.

- In the VMhosts vpxa log I can see this:

[netfs://10.10.1.7//export/VMprod4/ebs2/ebs2-000001.vmdk -> netfs://193.29.43.17//esxbackup-e/ebs2-11211-0020/ebs2-11211-0020_1.vmdk] failed:
[2011-12-04 06:44:52.397 FFEBCB20 verbose 'App' opID=83ddb387-c0] [VpxNfcClient] Closing NFC connection to server

...

[2011-12-04 06:44:52.399 FFEBCB20 info 'App' opID=83ddb387-c0] [VpxLRO] -- ERROR task-392968 --  -- vpxapi.VpxaService.nfcCopy: vmodl.fault.RequestCanceled:
Result:
(vmodl.fault.RequestCanceled) {
   dynamicType = <unset>,
   faultCause = (vmodl.MethodFault) null,
   msg = "",
}

- In the vCenter vpxd:

[VpxdVmomi] Got vmacore exception: Operation was canceled
[2011-12-04 07:51:26.232 03608 error 'App' opID=e1542319] [clone] (ebs2) Unexpected exception (vmodl.fault.HostCommunication) during clone. Aborting.
[2011-12-04 07:51:26.232 03088 warning 'vmomi.soapStub[3]'] Canceling invocation: server=TCP:tpvmhost2.tesis.de:443, moref=vpxapi.VpxaService:vpxa, method=querySummaryStatistics
[2011-12-04 07:51:26.232 02508 warning 'QuickStats'] Error returned from calling FetchQuickStats: class Vmacore::CanceledException(Operation was canceled)

...

[2011-12-04 07:51:26.251 03600 error 'App' opID=7de204c] vmodl.fault.HostCommunication

...

What we tried so far:

- Checked DNS and network, no problem here

- Cloning the affected VM's to a local or Linux NFS datastore works fine, although it's horribly slow, starting with 80MB transfer rate, then going down to 20MB for hours

- Moving the VM (cold copy or using converter) to a different vSpere environment has no effect, cloning still fails

- Replacing the switch

- Replacing network cards

- Testing with several ESXi hosts and vCenter server's

- Upgrading network drivers on the vCenter and Backup machine (Broadcom)

- We traced with Wireshark on the Backup machine and found lot's of errors related to TCP offload (Chimney), so we disabled it in the VM, the vCenter server and the Backup machine (in the driver and by registry keys). This has improved the situation (I can now clone some VM's which failed before, but the problem persists for others)

I would blame haneWin nfs server, but since cloning works fine for 99% of our VM's I'm not sure about that. And ending up in a disconnect VMhost is also an unexpected behaviour which should not occur.

Any hint is is highly appreciated

Thanks,

Peter

Reply
0 Kudos
0 Replies