VMware Cloud Community
turnbulm
Contributor
Contributor

Failed to deploy template: Operation timed out

I have a VI3 installation with ESX on a Dell 2850 server connected to a NetApp iSCSI datastore. There is a template in the datastore that I'm trying to deploy.

Any time I try to deploy the template I get an error (after about 15 minutes) saying "Failed to deploy template: Operation timed out". There is still activity on the storage at the time this error occurs.

I know the deployment will take time on the storage side as it creates the disk, but surely this shouldn't cause a timeout?

Is there any way to either increase the timout, or something else to make this work?

Thanks!

0 Kudos
49 Replies
sdreher
Contributor
Contributor

I have a VI3 installation on an IBM bladecenter using Lefthand iSCSI datastores experiencing EXACTLY the same circumstances described. Also, I have DRS enabled. I don't think it has anything to do whatsoever with DNS, hardware, etc. I think it is the way in which VMFS communicates over iSCSI (among the many OTHER shortcomings in that environment). Even if one creates a template on a local datastore and tries to clone to the iSCSI datastore one gets the same results. Im experimenting with NIC teaming (or lack of) as well as bonded pairs on the iSCSI architecture. Everything in Microsoft and EXT3 filesystem land communicates perfectly with iSCSI, yet VMFS has tremendous difficulty. I am currently trying the timeout method described and will keep everyone posted. I've been told by VMWare engineers, salesman, and even an exec that the 3.01 release will fix MUCH of the iSCSI trash they have created since RC, so hopefully a matter of a week or so.

0 Kudos
sdreher
Contributor
Contributor

This solution worked for my environment with a 20GB clone template:

Connect to each ESX Server through VI Client -> goto Configuration menu-->

Advanced Settings --> goto SCSI -->

first option SCSI reaborttimeout - change this value form 5000 to 50000

Second option SCSI scantimeout - change this value from 1000 to 10000

Thanks everyone.

0 Kudos
Dr_Hofmann
Contributor
Contributor

I found it! In our case there was a bad batterypack on one of the controllers. I change the battery and after that it worked.

0 Kudos
RTF_PM
Contributor
Contributor

I'm a new VM user with ESX 3.x on an HP DL380G5 and HP MSA1500.

I had this very problem trying to deploy Win2000 server 30GB VMs via a template. It continually failed with Operation Timed Out error (after approx 15mins). Modifying SCSI timeout settings as shown above didn't help.

Realised issue was affected by the following factors (which is why there is still no definite explanation or fix yet):

1) the OS of the template (Win2003 seems to work better than 2000)

2) the SCSI type used (Win2000 default of BUS logic vs LSI Logic)

3) using deployment customisation or not

4) the VI Client (yes, getting fiddly here!)

5) and the number of deployment attempts per VI Client session (don't know if just a coincidence or a factor).

After much testing (and frustration), I have managed to now consistently create 30GB Windows 2000 Server SP4 VM's using a template via the following:

1. Create a base Windows 2000 VM using LSI Logic SCSI drivers (V1.09.11) - the drivers need to applied using F6 at Windows setup.

2. Run Windows sysprep on the base VM and power off (this eliminates need to run customisation during VM deployment).

3. Clone to template. Close VI Client.

4. Run VI Client on VC server itself.

5. Deploy template to new VM without[/u] Customization.

6. When finished, close VI Client.

7. To deploy further VMs from same template, run VI Client on VC server itself BUT only do one deployment each time i.e. logoff/re-logon for each one.

Yes, it is a strange solution but it seems to work.

If anyone could put more qualitative information to this process, it would be appreciated. So far our case with HP/VMware tech support hasn't found a solution.

0 Kudos
hansuleberg
Contributor
Contributor

Hi, you all for great tips on troubleshooting.

To change the advanced settings helped me. I have 2 DL 385 AMD 64 DC Opteron's with standard HP hardware. I think that some of the replies here, is wrong. To copy a file between RAID 1, RAID 5 or RAID 10 shouldn't be any problem at all. If it was, that's scary movie. I figured out about the timing as I have been trying for some days with cloning, and making new VM's out of many templates XP, 2000, 2003 and between raid 1 and 5's and not. It took aprox, the same time everytime to get the time out error and around 15 minutes or 85-87%. So yeah, the timeout value must be IT.

It fixed the timeout probvlem for me.

Hans, Norway

0 Kudos
sdreher
Contributor
Contributor

I currently have five Lefthand NSM-260's and was recently told that Lefthand is recommending one create a separate LUN for every virtual machine. This workaround goes along the lines of LUN multi-paths and whatnot, but is the most recent fix one can try. Perhaps it might work with other iSCSI vendor's as well.

Dr. Hofman, I was told by an EMC solution architect that even not plugging in all of the power cords into some iSCSI solutions can cause write back cache to be disabled, and Im wondering if the same is true of the battery being bad too.

0 Kudos
dmarotta
Contributor
Contributor

The timeout is the easy fix, what I'm longing for is a way to reduce the time it takes to deploy a template.

In Virtual Center 2, we could deploy the same system template on the same CX700 to the same disks, using the same HBAs and the same Dell 6850s in 10 minutes or less.

Now that we are fully upgraded to Virtual Center 2.01 and ESX 3.01, it takes closer to 40 minutes to deploy a template using the same physical infrastructure.

How can we make the deployment run faster like it use to be? If I knew it was going to add 400% more time I would still be using 2.5.2.

0 Kudos
sdreher
Contributor
Contributor

LATEST UPDATE: I have something new to add in regards to the following:

Advanced Settings --> goto SCSI -->

first option SCSI reaborttimeout - change this value form 5000 to 50000

Second option SCSI scantimeout - change this value from 1000 to 10000

After applying Lefthand's 6.6 update AND the VMWare patch, one needs to set these values BACK to 5000 and 1000 as I was actually getting the "operation timed out" issues with the higher values.

0 Kudos
RTF_PM
Contributor
Contributor

May have finally found the cause of our particular case...

The HP Insight Management Agent (7.5 and 7.6) running on ESX 3.0 (and 3.0.1 when later upgraded) was causing a number of problems including Automatic System Resets (ASRs) on an HP DL380 G5. To troubleshoot, the agent was removed and since then ESX 3.0.1 has been stable.

Subsequently template deployment has also worked each time with no timeout errors.

I was getting some template deployment success (see above[/i]) before the agent removal by using a 30GB Win2000 Server with LSI Logic SCSI. However now since the HP IM agent removal, I can deploy 30GB Win2000 Server with Bus Logic SCSI (the preferred default option) successfully each time. This did not happen when the IM agent was on - always giving 15min time out error.

The question is how well tested is the HP Insight Management Agent for VMware on G5 servers? HP already know it to be an issue:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c00748635

The current latest version 7.6.0 didn't improve our situation either. See

http://www.vmware.com/support/esx25/doc/sys_mgmt_links.html

Obviously we need proactive software like IM agents on our HP servers, but they can't be so disruptive.

Any ideas anyone?

0 Kudos
jkubik
Contributor
Contributor

RTF_PM:

You may want to experiment with the different components of the Insight Manager agents. The agent for Fiber storage, local scsi, and network are separate pieces as I understand.

0 Kudos
pmackay
Contributor
Contributor

i have a strange one.... At one point deploying machines via templates was working fine, but now it no longer works...as far as i am concerned nothing has changed ??!? One thing i have not done is upgraded to the latest version of ESX 3...anyone know if this makes a significant difference to the problem!

One thing i will add (not sure if this has been mentioned) but i do not actually get any progress % , all that consistantly happens is that after 15 mins it times out...

0 Kudos
jkubik
Contributor
Contributor

Start by looking at the vpxd logs on the VC server.

Then, on the lost look at:

/var/log/vmware/vpx/vpxa.log

/var/log/vmware/hostd.log

/var/log/vmkernel.log

There is most likely some useful error message in those files (look only at the time when the template deployment failed).

0 Kudos
ICT-Freak
Enthusiast
Enthusiast

i have the same problem on my test environment. I use Openfiler as an iSCSI target. This worked like a charm till last weekend. Now i' am having the same problem as you guys..

For so far i think it is a problem with VirtualCenter 2.0.1 patch 1. Everything i want to do within VC it stops with the error "timed out". When i manage my esx host with the VI client it works normally.

I go check the log files on the VC server and the ESX host.

PS. Do you have the VC in a VM or stand alone?

0 Kudos
mhanson
Contributor
Contributor

anyone find a solution to this...we just had this start with VC2.0.1 today.

Cannot create new VM, template or not, cannot migrate, pretty much cant do anything. no patches or anything applied recently.

any help would be appreciated.

0 Kudos
Steven_Clementi
Contributor
Contributor

I am having the same issues...

My Configuration is as follows:

HP Proliant c-Class Blade Enclosure with 6 BL465 servers.

2 Servers running Windows for Virtual Center and SIM and other management software and 4 ESX 3.0.1 servers fully patched.

Solution is using an MSA1500 for Shared Access to the Storage for VMotion.

Setting the advanced options did not help any as the job aborts in 15 minutes no matter what.

What I noticed in the log files is that it seems like the server can not read the template file.

Anyone know why this would be?

Steven

Message was edited by:

Steven Clementi

0 Kudos
ICT-Freak
Enthusiast
Enthusiast

What i did to resolve the issue was a fresh install of VMware ESX 3.0.1 and Reconfiguring the VMware License Server with fresh downloaded lic files from the VMware website.

I hope there will be a better solution for this problem but for now, I'm happy that my environment is working again.

0 Kudos
squidfishes
Contributor
Contributor

Is there any other solutions to this problem? I can do the re-install, but somehow I hoped for something more ... elegant.

Cheers!

0 Kudos
Knobee
Contributor
Contributor

Having just lost an ESX 3.0.1 server with locally attached storage (that included my license server), I can say that re-installing virtualcenter and ditzing around with license keys does not (in my circumstance) solve anything.

I'm seeing timeouts on locally attached (4x 173GB in a RAID-5 config) storage when creating clones from a template with no customization.

The first one timed out before it even started the copy.

0 Kudos
williambishop
Expert
Expert

Been through all the hoops, found a possible fix, edited the hosts file and added every esx host to the list and it seemed to resolve the problem for the most part. Still happens occasionally but not like it was(nearly every time).

--"Non Temetis Messor."
0 Kudos
Weatherman
Contributor
Contributor

I have experienced this problem in both my lab and production environments. I was able to fix it in both by simply stopping, starting, and re-reading the license file. Sounds pretty stupid but it solved my issue in two separate VI3 environments.

0 Kudos