VMware Cloud Community
kdon
Contributor
Contributor
Jump to solution

w2k3 BSOD after vmotion

I have a vm that BSOD after vmotion from esx4.0 to esxi4.1.  I have vmotioned 150+ other vm's just fine and this one is giving me trouble.  When I put it back on the 4.0 host it boots back up normal.

Any suggestions?

Thanks

0 Kudos
1 Solution

Accepted Solutions
BenConrad
Expert
Expert
Jump to solution

Regarding Andre's comment about CPUID masks, that's a great place to investigate.

A) is your cluster running an EVC baseline?

B) If your VM has a CPUID mask defined Edit > Options > CPUID Mask

C) Do other VMs have a manual CPUID mask set?  Are the masks different than this one VM?

D) If you vMotion a VM with a mask that has  (for example) SSE4 enabled onto a VMhost that doesn't support SSE4, you can definately get a BSOD.  the EVC baseline will prevent this from happening.

Ben

View solution in original post

0 Kudos
15 Replies
a_p_
Leadership
Leadership
Jump to solution

Which BSOD do you get?

Does the BSOD only occur with vMotion or also when you power on the VM on the 4.1 host?

Has the VM been created as a virtual machine or has it originally been P2V'd? In case of a P2V'd VM, there might still be some old hardware related drivers which may cause the BSOD.

Does the VM have any individual CPU mask settings? Maybe it helps to reset them.

André

kdon
Contributor
Contributor
Jump to solution

Thank you for the quick reply Andre!  Come Monday I will have answers to some of your questions.

Thanks!

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Which BSOD do you get?

- I am not sure because I have to wait for an outage window to reproduce it again.

Does the BSOD only occur with vMotion or also when you power on the VM on the 4.1 host?

-It occurs a few minutes after the guest has been vmotioned onto the 4.1 host.  It stays up for a few minutes and then blue screens.  I was not able to print screen the blue screen fast enough.  When I would power cycle the guest, it would get stuck in a reboot cycle.

Has the VM been created as a virtual machine or has it originally been P2V'd? In case of a P2V'd VM, there might still be some old hardware related drivers which may cause the BSOD.

-Upon looking into the guest some more, it does appear it was a P2V.  Looking at hidden devices I do see a couple.  One that worries me primarily are cpu's from the old system.  I do not see old NIC's or any other's that are throwing me off.  Suggestions or ones to look for perhaps?

Does the VM have any individual CPU mask settings? Maybe it helps to reset them.

-What do you mean by this?  CPU/MMU virtualization?  If so, it is set to automatic.

Thanks in advance, and I am working on getting an outage to be able to play with the guest some more.

Thanks again.

0 Kudos
BenConrad
Expert
Expert
Jump to solution

Regarding Andre's comment about CPUID masks, that's a great place to investigate.

A) is your cluster running an EVC baseline?

B) If your VM has a CPUID mask defined Edit > Options > CPUID Mask

C) Do other VMs have a manual CPUID mask set?  Are the masks different than this one VM?

D) If you vMotion a VM with a mask that has  (for example) SSE4 enabled onto a VMhost that doesn't support SSE4, you can definately get a BSOD.  the EVC baseline will prevent this from happening.

Ben

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Thank you for jumping in as well Ben.

A) is your cluster running an EVC baseline?

-No EVC is being used in the cluster.

B) If your VM has a CPUID mask defined Edit > Options > CPUID Mask

-I guess I can't see if it does because the vm has to be off?\

I could answer C based off of B and then D no EVC..

Thanks Ben!

0 Kudos
BenConrad
Expert
Expert
Jump to solution

Without power off you can look at the .vmx file of 2 VMs and compare.  Or, you can do this via powercil:

connect-viserver YourvCenterServerName

(get-vm VM1 | get-view).Config.CpuFeatureMask

(get-vm VM2 | get-view).Config.CpuFeatureMask

and compare the difference (if any).

Ben

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Ben - Here is a screen shot of two boxes compared.  The second one is the one I am having issues with.  I am not sure how to read the output, but it would seem it has some sort of Mask set on it?  Take a look when you have a chance, and thanks!  I piped out the results to notepad as well as it was truncating the results, and the results were the same in the notepad, heh..

0 Kudos
BenConrad
Expert
Expert
Jump to solution

Try that powershell statement again, you should not need to select any specific output:

2011-09-26_143627.jpg

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Ah, you know what my problem was, I was trying to run the command on an XP machine and it was not returning and output.  I ran it on the box itself in question and it came back with output.  I had researched another way of getting the data in the first time getting information, sorry!  What do you think with what I have below?  Thanks Ben!

cpu jpg2.jpg

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Rather, if it doesnt return any output, does that mean nothing is set for the mask perhaps?

0 Kudos
BenConrad
Expert
Expert
Jump to solution

Output in previous post looks valid, definately something set there.  If you get no output that means no mask.

Ben

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Ok thank you Ben.  I am not familiar with the mask and or setting the cpu mask.  I have my window to work with the guest tonight.  I will play around with it and see what I can come up with.  Is the suggested setting to set it to "Expose the NX/XD flag to guest".  That is what majority of my vm's are set at.

Thank you for all of your help!

0 Kudos
BenConrad
Expert
Expert
Jump to solution

Yes, try to match exactly the masks on the VMs that work well. Maybe you can look at the current mask in one of your templates, or the .VMX and match line for line.

Ben

0 Kudos
kdon
Contributor
Contributor
Jump to solution

Thanks a lot Ben!  I will hammer away tonight.

Much appreciated!

0 Kudos
kdon
Contributor
Contributor
Jump to solution

After working on the guest machine last night I have it working.  Here is everything that I did to it.

cloned source guest to not impact and have a rollback

everything I did to clone:

removed serial port
removed time sync with esx host - the guest resides in a different domain which is in a different time zone.
remove hidden devices - cpu's from P2V
removed vmware converter agent

set cpu mask to resemble another server that is alike - 2 cpu
removed uneeded software from physical server aspect

installed vmware tools onced vmotioned and stable - waited 15 minutes before doing this.

change from bus logic to lsi logic parallel.

It has been up and running for 12+ hours now!

Thanks for all the help!

0 Kudos