We have an issue here with a server that won't vmotion or migrate properly to another host.
Our current configuration:
5 ESX 3.0.1 servers in an HA cluster, all running from one SAN, the guest is on shared fibre disk storage in a VMFS3 parttion. All hosts patched to the September released patches.
The guest has two raw attached disks from the SAN. The disk luns are presented to all ESX servers, all ESX servers in the cluster can see the raw LUNs and the shared VMFS storage space.
The guest is a highly critical Windows2K3R2, MS-SQL 2000 server in our environment, and arranging down time on it is difficult.
The situation is as follows: when the guest last attempted a VMotion to another host, it failed with the error "Unable to resume on target" and the guest hung. The ESX server with the guest could not shut the guest down or power it off, Killing the server thread left the guest unable to start - marked in inventory as -invalid. Removed from inventory and re-imported, guest would start. Have not tried a vmotion since on this guest. Other guests running on the same cluster vmotion fine to and from this host.
Last night I was able to shut the guest down, and migrated it while powered off to another host in the cluster. When powered back on, it immediatly started up again on the original host. Nothing in the logs indicated a problem with migrating it to it's new target, and nothing in the event window showed it migrated back before powering on.
Attempted to back it up with esxranger, esxranger fails indicating the guest hardware as not been upgraded.
Guest history: This guest one of our first VM guests running originally on esx 2.5.x. We had two servers available, and the guest was vmotioned fine at this time on these two hosts. When we installed the VI3 cluster, the guest was cloned to the VI3 system, hardware upgraded, powered on and VI3 tools installed, guest rebooted again and it has run flawlessly on one host ever since. This was done about 3 months ago. The problem appeard when placing the host in maintenance, and the guest could not vmotion off.
I tried shutting down the guest to check if the hardware could be upgraded again, but no option in VC console for this option on the guest so as far as I can tell the hardware upgrade was done.
Since this VM is using RDM's are you sure the LUN's are being presented exactly the same to the other ESX servers? I had the same thing happen today when I tried to Vmotion a VM from one server to another, it failed at 90% and when I checked the vmware.log it was complaining that it could not open one of the RDM disks.
Disk1 is v VMDK file on shared storage. 20GB - OS and Application
Disk2 is Mapped Raw LUN vmhba1:0:13:0 LUN ID 13, 50GB RAID1 - SQL Log
Disk3: is Mapped Raw LUN vmhba1:0:14:0 LUN ID 14, 150GB RAID5 - SQL Data
From Virtual Centre, on each ESX server under Configuration - Storage Adapters, the same disks appear to be available under the same LUN IDs and cannonical paths.
Now I have just found some more information on the order of the events for trying to start the guest on another host: VM Powered down and migrated, this event shows success. The next event is the Virtual machine Power On, and it has three related events where I can see it migrated back to the original host. There are no server or CPU affinity rules set for this guest, the vmotion resource map shows it can see all of the hosts, and the VMFS storage is visivle and the verify for vmotion says it is OK to migrate. Any other place to look for something that may fore it to be on this one host only?
As the other gentleman said, the RAW LUNs are the issue. I had to detach the RDM disks, mask the LUNS, migrate the W2K3 server, unmask the LUNS, re-attach the RAW disks to the VM. (Masking the LUNS may not have been necessary; but I was annoyed).
Try it. Yes, it means downtime.
Just to clarify though, Vmotion should work just fine when using RDM's. It does require that the RDM LUN's are presented with the same LUN number to all the ESX servers to work though. You can go into Configuration, Storage Adapters on each ESX host to make sure it does see the RDM LUN's. If not do a Rescan on the Host to see if it picks up the RDM LUN's.
CDROM set to not attach on startup and is not currently attached. VMotion nomally alerts this when it verifies if the vmotion is allowed.
On Monday I'm going to deploy a test guest and try vmotion with and without some RDMs attached and see how it goes. But as indicated in another post here, the hosts all are able to see the disks, The VM ID and LUN number appears (atleast in Virtual Centre) as the same on each host.
I don't know if it is relevant, but the hosts are Dell 2950 Dual CPU Quad core Xeon, dual patch fibrecards running the Navisphere agent. The SAN is a Dell/EMC CX500, switches are Brocade. We had no problems with vmotion of this guest when under ESX 2.5 attached to the same SAN. The issue has only arisen since it's migration to 3.0.1
We have a storage Group for all of our ESX Hosts, the LUN is bound and assigned to the Storage Group. I then rescan all the luns on each ESX host until they see the disk, then attach the disk to the guest via the "edit settings" dialog in VC2.
I've been able to sucessfully VMotion a test guest created from a template with an RDM attached, I'm arranging another test on the SQL server for tonight to see if the problem was a one off or is persistent.
I opened a case with vmware and I have some interesting news on this issue.
Support looked in the vmkernel log for a lock holder and found a few mac addresses which had a lock on the LUN's, preventing the vm to vmotion.
We talked about what could cause the lock and one theory was the Dell Openmange software which I have running on my hosts.
A reboot would do the trick we but.. a reboot and shutdown of the host with the lock did not re-lease the lock!!
I then stopped the dell drivers (dsm_om_connsvc and dsm_om_shrsvc) and rebooted the host and that released the lock!
I will see if I can reproduce the lock again the next coming weeks and test this again.
If you have the same issue with a vm and a mapped LUN failing a vmotion at 90%, (A general system error occurred: Unknown failure migrating from another host)
have a look in the vmkernel log of the host where your vm is running on and look for the line which says "lock holders":
Checking if lock holders are live for lock [type 10c00001 offs et 43483136 v 10, hb offset 3744768
gen 46, mode 1, owner 47bd72f3-61d0a598-2bea-001aa00e839a mtime 97510]
The red part is the mac-address of the NIC on the host you are doing a vmotion to which holds to lock.
We use Dell Openmange but maybe you are using The HP software to monitor your hardware or maybe you are not using it at all and still have this problem.
Let me know so we can make this a good thread to help solve this strange lock.
Again this only happens with a vm and a mapped LUN. If you remove the mapped LUN you can do a vmotion but you can't add the mapped LUN on the new host.
It will then let you choose the LUN but on finish you get the error: Operation failed due to concurrent modification by another operation.
I have the same problem. We are using the monitoring tool from fujitsu siemens called serverview and have agents installed on all our ESX-Servers. I can see the process of the HDD agent (from the serverview agents package) is running and I cannot stop this service (even if I use kill -9 <pid>). We have 2 ESX-Servers on which I cannot kill the HDD agent. All other ESX run fine.
Have you any response from VMware?