I was on the phone with VMWare for 2 hours about this problem and no one can seem to figure this out.
We want to upgrade our existing ESX clusters from 3.0.2 to 4. In the test environment, we had no problems, however, when we tried to do this on production, we have hit a huge snag.
We have already upgraded Virtual Center from 2.0.2 to 4 with no real hiccups. It took a little longer becuase of some manual changes we had to make to the database, but overall it was pretty straight forward.
Today, we tried to upgrade on of the ESX servers. So I evacuated all the running VM's on the server and manually placed it in Maintenance mode. I ran the Host Update Utility instead of the Update Manager becuase I didn't want to upload the ESX DVD ISO across our WAN. (The Virtual Center Server is in another City, with 5 esx servers there, 3 in my site and 2 in another city's site.) I started the Host Update Utility, and everything seemed to be going well until I got an error saying that the upgrade failed. Looking at the console on the ESX server I got an error saying the following:
Driver's Successfully Loaded
Cannot find device with UUID: xxxxxxxxxxx
Error: Cannot find device with UUID: xxxxxxxxxx
Press <return> to reboot
When I reboot the server, it starts the upgrade again, and fails at the same point. I managed to get the server to start back up in ESX 3.0.2 successfully, and I found out what the UUID was for. It goes to the /boot partition. I looked in there and looked in the /boot/grub/grub.conf and everything is matches up fine.
I then modified the grub.conf so that it booted up ESX Server instead of ESX Upgrade and contacted VMWare support. We tried multiple things including an upgrade script that you run directly from the ESX Server, but all it does is what the Host Update Utility does.
The technician then recommended that I upgrade to ESX 3.5 and try it again, so I did that and I still get the same error.
If I remove the bootpart=xxxxxxxxx and reboot, it then says it can't find the UUID xxxxxxxx which is my root partition. I haven't tried removing both lines, but I assume if I do that, then the upgrade will have no idea what to upgrade.
We can't do a clean install on these servers becuase of the configurations on the ESX Servers for some of the VM's. We have a VM Cluster that is tied specifically to two of my ESX servers and communicate with a SAN directly, not through the ESX Server (we had to do this for the company to support the software, not my call. I would have done it right.) If it wasn't for that one restriction, I would have no problem rebuilding the entire ESX server and installing all my SAN software and drivers so that it can communicate with my LUN's I have carved out for it.
My hardware for my ESX Servers is the same across all the sites. They are Dell PowerEdge 2950 with 2 Quad Core Xeon Processors and 16GB of ram with 6 72GB SAS hard drives attached at the PERC 5/i controller. I have two sets of mirrors and two hot spares in the servers. They are connected to my Dell CX-310c SAN via the qlogic FC card.
I am at a loss right now. Does anyone have any ideas?
I have a stupid question.
Why even upgrade?
What feature do you need?
I still have machines running gsx server,vmware server 1.0|2.0, for which I cannot justify an upgrade.
I know this is not the answering your question.
I would just be scared on any upgrade without have the ability to roll back, especially since it is at a remote location.
shooting blanks in the dark, here.
I guess the issue is with the UUID of the target VMFS datastore for COS vmdk.
Please try the following steps, it may help you.
In the problem description you have metioned that you tried scripted upgrade too. This time, execute the script esxupgrade.sh with the switch -r, so the host will wait for a manaul reboot to start the upgrade.
#./esxupgrade.sh -r -i <4.0 ISO file> <target VMFS datastore>
Now you open the auto generated kickstart file, /etc/vmware/ks-upgrade.cfg. You can see the autopart entry with the UUID of the "target VMFS datastore". Check that the UUID of the VMFSdatastore and the UUID in the kickstart file are same. you can get the UUID of the VMFS volum using vmkfstools.
vmkfstools -Ph /vmfs/volumes/FCSAN1/
If you see any difference in the UUID in kickstart , change it to the corrct one which you can see in the vmkfstools and reboot the host. The host will starup wth ESX upgrade.
autopart will support only the UUID not datastore names. autopart entry in kickstart file should look like the following.
Please update the thread with your findings.
There are a couple of features that we want to use, the major one being able to VMotion Storage. We also want to be able to change the size of hard drives and memory on the VM's on the fly, which you can only do with VM Version 7.
There are no custom partitions for the ESX Server. It is a right out of the box install on the server. There are a couple of custom LUNS for one of the running virtual machines, but that should not be an issue since the ESX Server does not see those LUNS, just the VM. I had a copy of ESX 3.5 no Update (build 64607). The upgrade from 3.0.2 to 3.5 was seamless and worked perfectly fine. It was the upgrade from 3.5 to 4 that gave me the same error as the upgrade from 3.0.2 to 4.
I am currently downloading 3.5 Update 4 and try to upgrade to it and then try to upgrade to 4 again at the advice of VMware, but I don't think it will work either. It keeps failing saying that it Cannot find device with UUID: xxxxxxxxx and the UUID is that of my /boot partition. I find it weird that it can find the kernel in the /boot partition and boot it up, but after it finishes loading up all the drivers for the hardware, it says it can no longer find the /boot partition.
I looked at the kick start file and the UUID is correct for the VMFS Datastore.
Contents of /etc/vmware/ks-upgrade.cfg:
generated by VIU
autopart --onvmfs=46b9f2f6-1d38f3fe-d484-0019b9da2b15 --extraspace=1030
Read out of vmkfstools -Ph /vmfs/volumes/storage1
VMFS-3.21 file system spanning 1 partitions.
File system label (if any): storage1
Capacity 60G, 58G available, file block size 1.0M
Partitions spanned (on "lvm"):
The UUID in question is the one for my boot Partition. After the reboot, It loads up the ESX Server 4 kernel and goes through LOADDRIVERS. It loads all the drivers and then has the following output on the screen:
Cannot find device with UUID: 868f93ae-5253-4077-aa93-2c2f5aa68716
error: Cannot find device with UUID: 868f93ae-5253-4077-aa93-2c2f5aa68716
press to reboot
That UUID is mapped to my /boot partition according to /etc/fstab
In the grub.conf I have the following:
grub.conf generated by anaconda
Note that you do not have to rerun grub after making changes to this file
NOTICE: You have a /boot partition. This means that
all kernel and initrd paths are relative to /boot/, eg.
kernel /vmlinuz-version ro root=/dev/sda2
WEASEL -- timeout=10
WEASEL -- timeout 5
WEASEL -- default=0
WEASEL -- default 0
title Upgrade ESX
kernel /esx4-upgrade/vmlinuz mem=512M upgrade bootpart=868f93ae-5253-407
7-aa93-2c2f5aa68716 rootpart=790c9cae-90c3-4cae-8ee0-c7b952eb767e ks=file:///etc
title VMware ESX Server
kernel --no-mem-option /vmlinuz-2.4.21-47.0.1.ELvmnix ro root=UUID=790c9
cae-90c3-4cae-8ee0-c7b952eb767e mem=272M cpci=2:;5:;9:;10:;12:;14:;
title VMware ESX Server (debug mode)
kernel --no-mem-option /vmlinuz-2.4.21-47.0.1.ELvmnix ro root=UUID=790c9
cae-90c3-4cae-8ee0-c7b952eb767e mem=272M cpci=2:;5:;9:;10:;12:;14:; console=ttyS
0,115200 console=tty0 debug
title Service Console only (troubleshooting mode)
kernel --no-mem-option /vmlinuz-2.4.21-47.0.1.ELvmnix ro root=UUID=790c9
cae-90c3-4cae-8ee0-c7b952eb767e mem=272M tblsht
I have changed the default from 0 to 1 so that if I have to reboot the server, it will start ESX Server instead of ESX Upgrade. When I am ready to do the upgrade, I will change it back to 0.
Hope this helps.
I have and I still have the same error after the LOADDRIVERS.
With the Host Update Utility, the progress bar gets to 67% before the failure, and I can't access any logs because it doesn't write any to the hard drive since it can't access it.
When I run the script, it works to the point of the reboot, I checked the KS file and everything matches up, even in the GRUB.CONF everything matches all up. I can't figure out why this isn't working.
If I could at least get a command prompt instead of a reboot command so I can see what UUID's it can see and try to figure out which is which, I think I could get this thing up and running. I really think that the loaddrivers load's a funky PERC 5/i driver that somehow renumbers my UUID's and that is why the upgrade fails. If I could see what this driver is changing them too, I think I would be able to get this to install.
ESX 3.5 boots up and operates with no problems. The upgrade path from 3.0.2 to 3.5 had no problems. I am currently running ESX 3.5 on one of my ESX Servers and everything runs smoothly.
My problem is when I upgrade from 3.5 to 4. Everything starts fine, but after it loads the drivers from the kernel, it crashes saying it can't find the /boot partition. It did this when I tried to do an upgrade from 3.0.2 to 4 as well.
The server started out running 3.0.2. We tried to do an upgrade to 4 using Host Update Utility, but it failed saying it could not find UUID 868f93a3e-5253-4077-aa93-2c2f5aa68716 and to press return to reboot where it would try to do the upgrade again and fail at the same spot making for a never ending loop of failed upgrades. I finally regained control of my system and changed my /boot/grub/grub.conf so that it automatically starts ESX Server so that I can at least continue to use my ESX Server. VMware support told me to try upgrading 3.0.2 to 3.5 and then try to upgrade to 4. I did that and the upgrade from 3.0.2 to 3.5 worked. I am currently running 3.5 on my ESX Server. However, when I tried to upgrade 3.5 to 4, I got the same error as I did with the upgrade from 3.0.2 to 4.
VMware now wants me to upgrade to 3.5 update 4 and try to upgrade to 4 after that, but I have to wait until Monday before I can do that since we don't make major system changes on a Friday.
Booting ESX 3.5 into any mode works fine. Its booting it into ESX Upgrade that fails.
Ok good news....
I tried to upgrade on another of my ESX servers and the upgrade went according to plan. The only difference between this ESX server and the one I am having trouble with is the SAN that is connected to it. I unplugged the SAN from the ESX Server and the upgrade went smoothly on it. Of course, after the reboot, I reattached the SAN and the server froze on the vsd-mount module doing boot up. So I disconnected the SAN again and did a POR and the server started ESX 4 with no problems. After the system was up and running, I attached the SAN again and I was able to browse to it on the server and on the vCenter Server.
The only thing I have left to test is my failover capability of my VM running on the ESX Server to verify that it can still find the LUNS on this SAN, but if it does, I'll have to make some changes to our POR procedures for this and another ESX Server.
Now that all your hosts are running 4.0, why don't you backup the server config - stick it into Maint Mode and do a clean installation and then restore the config back to the box afterwards?
See if this resolves the issues instead of having different procedures for different hosts?
If you still see the same issues, get the logs over to VMWare and get them to fix the bug.
I may just try that. The only problem that I for see is my configuration for the 2 vm's that use RDM for their cluster storage. That is what has been giving me and VMWare a headache recently. That and having to boot up with the SAN disconnected and then reconnecting it after boot up is complete.
I was finally able today to get both nodes to startup, but I had to go into the windows cluster administrator and disable the windows cluster resources for the storage and then start the other node, and once it was online, bring the cluster storage back online. The only thing I have left to do is test the failover and hope it works.
Talking with VMware, the technician said that the HP MSA 1000 is no longer on the HCL, even though it does work with it, it is not on the HCL.
I think now is a good time to buy more storage for my other san and see about migrating this to that. I wouldn't have had any problems if it wasn't for the MSA 1000.
Now, if only we could figure out this problem....but that is for another thread.
Thank you to everyone with their pointers.
If I may ask what exactly you did? I am running into the same problem on a server going from 3.5 to 4.0. It fails on a UUID that is also my boot partition. It boots fine into 3.5, but I cannot get the upgrade to go through successfully. I don't have a SAN connected and in fact I have removed the ISCSI configuration completely.
Basically, we had a normal install of ESX 3.0.2 installed on two of our ESX servers. However, we installed RDM disks for the virtual Cluster we run using windows. What was happening was for some reason no one can figure out is that the MSA array was presenting the ESX Servers the drives used by the cluster and the upgrade was trying for some reason to boot up and install from those LUNS instead of the local storage on the server itself. When I unplugged the MSA, it no longer could present the storage to the Server and the server could load up off of the local storage and install properly. The reason that it was only effecting my MSA and not my Dell SAN was because Navisphere was not loading so the ESX server could not see the storage.
After talking with VMware, they said that the MSA 1000 is not longer in the HCL, which is a good enough reason for me to push to my supervisors that we need to get off of this array and get me more storage for my Dell SAN.
In your case, it seems like you may be running into a situation where your Server is trying to install on a device perhaps on your iSCSI?
Have you tried using the local install script from VMWare to try to install the software directly from the ESX Server? You can download it from:
http://kb.vmware.com/kb/1009440. This script basically runs the Host Update Utility from the ESX Server. Make sure you uncompress it on the ESX server using unzip becuase if you use Windows Compressed folder, it screws up the shell script. (I found that out the hard way.)
Let me know if that works for you.
Unfortunately that didn't help. The upgrade is still failing on the /boot partition for some reason. I am going to try opening a ticket with vmware next week and see what they say.
Thanks for the help.
I'm having a similar problem as well. I am upgrading from 3.5 to 4.0 using the host update utility. After the machine reboots, I get the error "Cannot find device with uuid... Press <return> to reboot".
I also tried doing a fresh install from the dvd, but when I got to the step that selects which partition to install to, ESX did not see any. I have an Adaptec 3805 controller, which is on the HCL, and two drives in RAID 1 which I want to install to.
Has anyone discovered the cause of this error?