Contributor
Contributor

Guest VM Issue post vmotion

Experiencing an odd problem with a W2003 VM running a product called Pharos. If the VM is live migrated to another host, either manually or via DRS, after a few hours the hosted application grinds to a halt, requiring a server reboot or another vmotion to correct. Another vmotion corrects the issue only for a short time, with a reboot the only fix. Memory and CPU stats inside the guest OS do not increase significantly post vmotion and there's nothing indicative in the server logs.

Any ideas or any tips as to where I could look to troubleshoot?

0 Kudos
12 Replies
Immortal
Immortal

It's happen only on a specific host?

It's a very strange problem, you have an active SnS use it to open a ticket to VMware.

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
Expert
Expert

This is more out of curiousity and not a solution.

We use Pharos but I don't have anything to do with so will ask my colleague if he has seen weird behaviour.

Does this affect all your Pharos VMs or for example just the backend servers?

I know we have several servers for front and back end Smiley Happy

Please consider marking my answer as "helpful" or "correct"
0 Kudos
Contributor
Contributor

Doesn't seem to be related to a specific host which makes it even more difficult to diagnose! Smiley Sad

Thanks both - sounds like we're in a similar position as the team I work in looks after the virtual infrastructure, and therefore hosts VM's for other teams in our organisation. I don't have anything to do with Pharos as a product so I'll double check this tomorrow, but as far as I'm aware this only affecting our primary instance of Pharos (?), we have other sort of satellite instances at other sites which contact this server.

I have noticed something previously which is probably totally unrelated, where a Windows guest which slow down ridiculously on boot (at the loading bar screen). If you vmotion the server, it speeds back up! Strange, but doesn't happen that often!

Will definitely use support as next step if no answers from here.

Cheers

0 Kudos
Contributor
Contributor

Just checked and this issue seems to be affecting our entire Pharos environment. As mentioned, we have a primary backend server and a secondary, both of which seem to be affected by vmotion. Any feedback or if anyone is experiencing something similar would be massively helpful.

0 Kudos
Immortal
Immortal

If you have a problem on the entire server, just verify that RAM is ok.

Use RamTest or similar program.

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
Contributor
Contributor

Hi Andre,

We do a 72 hour memory tests on all hosts as standard before adding them into the cluster (that's not to say that they havent developed a fault since then). This only affects two VM's on two totally seperate clusters, all other VM's are ok.

Thanks.

0 Kudos
Immortal
Immortal

Are the affected VM configurated in a different way?

OS? vRAM? vCPU?

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
Enthusiast
Enthusiast

Picking up on your comment about the windows loading bar taking ages and speeding up if you vmotion the server. I have experianced this on our old SAN when it was massivley overloaded and having SCSI reservation error's. I presumed (never looked into tho) that the vmotion will prioritize that machine and its SCSI locking on the LUN.

What is the hardware in this environment? How many VM's to a LUN etc?

Rich

0 Kudos
Contributor
Contributor

CPU, memory, etc - no different setup-wise to any other VM in our cluster.

We connect to an EVA 4000 at one site and an 8100 at the other, with regards to LUNs it all depends, some VM's have their own volume attached to a single LUN. In this case there are two VM's sharing 1 LUN at one site, and the same at the second site.

At most we have around 4 or 5 VM's to a LUN, but 1 VM to a LUN is quite common.

Thanks again.

0 Kudos
Expert
Expert

We are running our Pharos environment on physical hardware but we are in the process of upgrading,

which will be in a virtual environment. Our Pharos guy has already setup some test environment and

I have informed him about this so he will check if we run into similar problems.

Please consider marking my answer as helpful or correct even if its completely wrong Smiley Wink

Please consider marking my answer as "helpful" or "correct"
0 Kudos
Enthusiast
Enthusiast

Do logs show anything for this VM right about the time when it asks to reboot or vmotion?

"requiring a server reboot or another vmotion to correct the problem" did you reboot or vmotion?

Some VM guests do not support vmotion nor DRS, PGP (linux based) for example. Try disabling DRS for this VM. Shutdown the VM and Migrate to a different Host and to different store.

or disable DRs for this VM and uninstall and re-install VMware tools

This is how I have resolved similiar problem with PGP. See if you get the message again

> if you found this or any other answer useful please consider allocating points for helpful or correct answers <
Contributor
Contributor

The only thing I can find of note in the logs is the following -

May 20 10:55:18.486: mks| SOCKET 2 recv error 5: Input/output error

May 20 10:55:18.487: mks| SOCKET 2 destroying VNC backend on socket error: 5

This is around 2 hours after a vmotion event and 2 minutes before we instigated one. Found this post - http://communities.vmware.com/message/559961

But our guests dont freeze, if you go to the console you wouldn't think that anything was wrong with the server... Smiley Sad

0 Kudos