Re: Storage vMotion error with Debian / CentOS 6

alekp · ‎05-16-2012

Hey all,

I'm having a bit of an issue with storage vMotion in our cluster

We had 3 ESXi boxes (2 ESXi 5 and 1 ESXi4) running independantly. We have started to see an increase in usage, and i finally convineced the bosses we needed some form of HA in it, so I've setup vSphere to move everything over - but i have hit a bit of an issue and was wondering if you could help

I have setup 3 new ESXi 5 boxes and connected them all to our Openfiler SAN that presents an iSCSI mount to all 3 of the new ESXi nodes. Thats all fine and it works. All versions of ESXi5 are all the same as they were installed from the same disk

All the VM's on ESXi 4 have migrated over with no downtime, so everything is working fine. Now we are left with the 2 "old" ESXi5 nodes

I am trying to migrate the disks to the iSCSI mount so i can move them over to the new nodes (they are using local storage at the moment)

I have moved the redhat 5 VM's over and all the windows ones. What i am left with is 3 Debian 6 VM's and 1 CentOS 6 VM - all 64 bit

Now, when i try to move them (Right click > Migrate > change datastore) and select the OpenFiler SAN it starts, gets to 76% then fails with:

A general system error occurred: The migration has exceeded the maximum switchover time of 100
second(s). ESX has preemptively failed the migration to allow the virtual machine to continue running
on the source. To avoid this failure, either increase the maximum allowable switchover time or wait
until the virtual machine is performing a less intensive workload.

I googled around, and found that i can increase the timeout time and it should fix it, but i dont think it will as some people have said they have and it hasn't fixed it. I also need to avoid downtime if i can, we have a 1 week notice period.

I brought up a temporary NFS server to test as well, and same thing when i try to migrate it to the NFS mount.

Network layout (Old nodes)
NIC1, Switch 1 - Untagged 1002 for network
NIC2, Switch 2 - tagged 301 (SAN), tagged 2198 (private network) <- vMotion enabled on this nic

Network layour (New nodes)

NIC1,2 - Switch 1 - Tagged 1002 and 2198 (vMotion here)
NIC3,4 - Switch 2 - Untagged 301 (LACP trunk through)

Network labels line up on all

This is ONLY effecting the Debian and CentOS 6 VM's - Windows and CentOS 5 work fine - so i believe my network setup is fine as is all the storage as everything works fine, just storage vMotion for these 3 friggen VM's

It gets weirder - if i right click on one of the VM's that wont migrate, and clone it to a new VM and then migrate it - it works fine with the network cables disconnected.

I know what you are thinking - reboot the thing and then migrate it - want to avoid downtime, but that will be my next option. Its actually annoying me and playing on my mind why it wont work, so would like to figure it out

All VM's run the latest version of VMWare tools and it is running

If anyone can offer any ideas that would fantastic

Thanks guys

alekp · ‎05-22-2012

I found out some more

Its only the active state of the VM - the base storage can be moved ok, just the active state.

My guess is to do with the swap file

Anyone know if this can be deleted while running?

Thanks

zXi_Gamer · ‎05-24-2012

If you are asking about deleting the swap file while the vm is running, then DO NOT do the same. It might end up in drastic scenario. The main reason you might be facing this problem is that the current memory of the Debian/Centos could not be committed to the swap disk. vswp. during svmotion and hence this timeout is happening.

This could be the fact that the Goses are swapping more. Do try to increase the memory reservation for the vms to reduce the swapping and try to do a svmotion and after completion, revert the memory reservation.