I was on the phone with VMWare for 2 hours about this problem and no one can seem to figure this out.
We want to upgrade our existing ESX clusters from 3.0.2 to 4. In the test environment, we had no problems, however, when we tried to do this on production, we have hit a huge snag.
We have already upgraded Virtual Center from 2.0.2 to 4 with no real hiccups. It took a little longer becuase of some manual changes we had to make to the database, but overall it was pretty straight forward.
Today, we tried to upgrade on of the ESX servers. So I evacuated all the running VM's on the server and manually placed it in Maintenance mode. I ran the Host Update Utility instead of the Update Manager becuase I didn't want to upload the ESX DVD ISO across our WAN. (The Virtual Center Server is in another City, with 5 esx servers there, 3 in my site and 2 in another city's site.) I started the Host Update Utility, and everything seemed to be going well until I got an error saying that the upgrade failed. Looking at the console on the ESX server I got an error saying the following:
Driver's Successfully Loaded
Cannot find device with UUID: xxxxxxxxxxx
Error: Cannot find device with UUID: xxxxxxxxxx
Press <return> to reboot
When I reboot the server, it starts the upgrade again, and fails at the same point. I managed to get the server to start back up in ESX 3.0.2 successfully, and I found out what the UUID was for. It goes to the /boot partition. I looked in there and looked in the /boot/grub/grub.conf and everything is matches up fine.
I then modified the grub.conf so that it booted up ESX Server instead of ESX Upgrade and contacted VMWare support. We tried multiple things including an upgrade script that you run directly from the ESX Server, but all it does is what the Host Update Utility does.
The technician then recommended that I upgrade to ESX 3.5 and try it again, so I did that and I still get the same error.
If I remove the bootpart=xxxxxxxxx and reboot, it then says it can't find the UUID xxxxxxxx which is my root partition. I haven't tried removing both lines, but I assume if I do that, then the upgrade will have no idea what to upgrade.
We can't do a clean install on these servers becuase of the configurations on the ESX Servers for some of the VM's. We have a VM Cluster that is tied specifically to two of my ESX servers and communicate with a SAN directly, not through the ESX Server (we had to do this for the company to support the software, not my call. I would have done it right.) If it wasn't for that one restriction, I would have no problem rebuilding the entire ESX server and installing all my SAN software and drivers so that it can communicate with my LUN's I have carved out for it.
My hardware for my ESX Servers is the same across all the sites. They are Dell PowerEdge 2950 with 2 Quad Core Xeon Processors and 16GB of ram with 6 72GB SAS hard drives attached at the PERC 5/i controller. I have two sets of mirrors and two hot spares in the servers. They are connected to my Dell CX-310c SAN via the qlogic FC card.
I am at a loss right now. Does anyone have any ideas?
Just a quick verification, but you don't have a SAN connected to the HBA do you? One thing I ran into was having the SAN plugged in. For some reason, it kept wanting to use the MSA 1000 as primary storage and not seeing the built in storage on my PERC controller. I disconnected the MSA and had no issues.
Another thing you can check is to make sure that you don't have custom partitions. If you installed ESX 3.5 with the typical settings and did not make changes to the partition tables during setup, you should be fine.
No, the host is not connected to a SAN. I do have a second partition mounted from /etc/fstab so I can try removing that.
On a side note, updating the RAID card's firmware doesn't seem to have changed anything for me.
Upgrading my firmware did not help either. The only way I got it to finally work was to disconnect my HP MSA 1000 from the HBA and boot up that way.
It seemed like the MSA was trying to become the boot device during boot up, even though I had the disable boot device in the HBA's BIOS, and on my servers BIOS I was telling it to use the PERC controller. I also verified that the PERC controller BIOS was enabled to boot up the server.
VMWare told me that the major upgrade between version 3.x and 4 is the way that the system handles data stores. It was so that you can use vmotion storage, which is one of the mail reasons we went to the upgrade.
You shouldn't have to remove the second partition from the /etc/fstab, just try commenting it out.
Also, how are you performing your upgrade? Are you using the Host Update Utility or the shell script from VMware KB or Update Manager from vSphere Client?
Commenting out the second partition in /etc/fstab did nothing. I have tried using both the Host Update Utility and the Update Manager in the vSphere client with no luck. I'll give the script a try.
Finally got ESX4 installed! I was able to install from scratch using the DVD when I used the "noapic" kernel argument at boot. It is the same as this http://communities.vmware.com/thread/154852 , supermicro motherboard with the same adaptec raid controller.
Thanks for the help!
Can you send me the install command you used for the install so I can try it?
Sure, this worked when doing a fresh install from the DVD. When the disk first boots, press F2 for other options, then append "noapic" to the end of the boot arguments. I suppose I could have instead added that to grub.conf to get the upgrade to work.
What was your hardware is your host running? If its similar to mine, you may have found a bug fix for this issue that would make it easier to do this instead of unplugging my san and plugging it back in after ESX is booted up.
I'll try running the noapic boot parameter and leaving my SAN plugged in to see if that works.
Funny that you had to disable advanced programmable interrupt controllers to get it to work. Seems like a bug fix that needs to be addressed...
Something tells me that it will.
I figured out my problem was related to how the VMFS share was defined. Someone had defined it as esx01:storage1, for some reason that colon seemed to mess up the whole install process. Once I fixed that I was to complete the install and bypass the error.
Anybody have resolv this problem?
Anyone has resolv this problem?
root@xac root# vmkfstools -Ph /vmfs/volumes/xacstorage1
VMFS-3.31 file system spanning 1 partitions.
File system label (if any): xacstorage1
Capacity 141G, 139G available, file block size 1.0M
Partitions spanned (on "lvm"):
Open /etc/vmware/ks-upgrade.cfg, change "xacstorage1" to UUID.
Some error happens........
Can you try after removing the "/" which is appended after the datastore name? ie, "/vmfs/volumes/xacstorage1"
When I remove the "/",It works fine!
Beacuse I use "TAB", then Pop-up "/" .
Thanks yezdi for giving me a great help again~~
That worked for me too... I also had esx03:storage1. I replaced the colon with a space and wala it worked.
This initial installation was done by Dell services.
Well I tried another ESX and it worked perfect too. Then I went for the last ESX and the same problem. Re-booted and attempted the upgrade several times after even changing the name once again. After several other re-boots and attempts it finally worked.
Very strange. I guess need to keep trying and trying until it goes thru.
Here is the solution I found to a similar problem in my environment:
The error occurred during the grub update process which is the first part of the install.
Hope this helps someone,