I believe that this is NFS related. Here's why.
Got a brand new server today, lots of CPU, lots of memory. Loaded ESXi 4.1 Update 1 on it. Populated it with a single VM that resides on NFS. Ran the backup and had the same issues.
I then copied the VM to the local data store on the new server and repeated the test. Bingo! I lost only 2 pings when the snapshot was removed, and the 2 were seperated by 3 successful pings.
So the question I have is, does ESXi 4.1 Update 1 suffer from similar NFS issues that 3.5 does?
I tried adding the "NFSLocks" line to the /etc/vmware/config file, but that had no affect.
Have a look at the NFS section in the documentation found here - http://communities.vmware.com/docs/DOC-8760
This in the past has been an issue with the actual NFS server and not the release of vSphere, sounds like you're hitting that issue
Thanks for the link. I'm using 2 NFS servers. One is a NetApp 2050, and the other is a Dell server running RHES 5. I have 3 ESXi 3.5 hosts backing up just fine. It's just 4.1 that seems to be the issue.
I'm not aware of any specific NFS issues, I assume you're following NetApp's best practices for configuring your ESX host? Does this occur on both NFS servers? As mentioned in our documentation, in our configuration we have no issues and we're also on 4.1
You may want to contact VMware regarding snapshot removal via the CLI and if it has any known impacts on NFS based datastores
It happens on both the NetApp and the Dell/Linux machine. It does not happen with the older ghetto script on ESXi 3.5.
It also does not occur when manually creating/removing snapshots.
No it does not use CBT, the base snapshot code has not changed a whole lot. You also mentioned this script works on older releases of ESX, so this should also confirm it has nothing to do with the script. I suspect it's probably something with version of ESX
I would recommend you take a look at the hostd, vmkernel and vmwar.log to see if there's any issues during the period the script is running. You may also need to increase verbosity of the logs.
I have not tried the new script on the older VMs. I am using the script last updated 11/14/2009 for those.
Will try the new one and see if it works or not.
Just tested the new script on my ESXi 3.5 clients and it works fine. So the question I have is, why does it work on 3.5 and not on 4.1?
WIll take a look at the logs.
I'm really beating this dead horse, but it's driving me nuts.
Since my last post I did the following:
1) I walked through the script, then commented out the step where vmkfstools creates the backup clone. Snapshot create & removal works fine. So what does the vmkfstools command do that would create the problem?
2) Confirmed that this only happens on NFS storage. When the VM is on the local machine, it works fine.
3) Reviewed the logs. Nothing stands out, but I could be missing something.
4) Ran the script against VM's running on ESXi 3.5 using the same NFS servers, and there are no issues.
I'm at a loss. If I can't get this to work I'll have to blow money on a commercial package which I really don't want to do.
I will be out of office till March, 28th 2011.
During that timeframe I will check my emails just occasionally.
On urgent matters you can contact: firstname.lastname@example.org
Gregor Holzer | IT-Systems Engineer
PAY.ON | www.payon.com
Payment Technologies for Global Payment Solutions
Jakob-Haringer-Str.1 | 5020 Salzburg | Austria
phone DE: +49 89 45230 410
fax DE: +49 89 45230 411
phone AT: +43 662 890008 13
mobile AT: +43 699 150206 13
fax AT: +43 662 890008 99
Court: HR Munich | Docket-#: HRB 173756 | VAT-ID: DE 234431573
Executive Board: Markus Rinderer (Head), Robert Kuzelj, Nikolaus von Taysen
Head of Supervisory Board: Alan Goslar
PAY.ON GmbH Austria
Court: LG Salzburg | Docket-#: FN 315081 f | VAT-ID: ATU64439405
Managing Director: Christian Bamberger
This email and any attachments are issued by PAY.ON. It is
confidential and intended for the exclusive use of the addressee only.
You should not disclose its contents to any other person. If you are not
the addressee (or responsible for delivery of the message to the
addressee), please notify the originator immediately by return message
and destroy the original message.
This message and any attachments have been scanned for viruses prior
leaving PAY.ON; however, PAY.ON does not guarantee the security of this
message and will not be responsible for any damages arising as a result
of any virus being passed on or arising from any alteration of this
message by a third party. PAY.ON may monitor emails sent to and from
Looks like something changed between 3.5 and 4.1 and with NFS datastore. I suspect something is causing it to not ACK back or causing it to take slightly longer which in turn causes additional ping loses on the VM while we're copying the VMDK.
VMware would be the best to give you more details but I highly doubt they'll troubleshoot the script
Sorry I can't provide any additional information
OK, thanks for all the help.