VMware Cloud Community
Gabriel_Vieira
Contributor
Contributor

disconnecting vmnic causes ESX4 crash

Dear all

I'm running a VSphere 4.0 cluster with two IBM 3550M2 machines configured to use HA.

System configuration is as follows: 2 vmnic on subnetA (console) and 2 vmnic on subnetB

I noticed the following behaviour to cause the system to crash and shutdown:

1. start a vm on subnetB

2. Unplug one vmnic on subnetB

3. ESX crash

4. vm starts on the other host (expected behaviour).

I first suspected hardware problem but toghether with IBM I updated all possible firmware (IMM, UEFI, servRaid, Broadcom).

Anyone experienced this? Thank you for any comments.

Gabriel

Tags (3)
0 Kudos
12 Replies
jamesbowling
VMware Employee
VMware Employee

Do you see anything in the logs on the host? Also, by crash, do you mean it reboots or do you experience a PSOD?

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
Gabriel_Vieira
Contributor
Contributor

Hi Jamesbowling,

the server shutsdown. I have to start it back by turning power back on.

Regards,

Gabriel

0 Kudos
jamesbowling
VMware Employee
VMware Employee

Does this happen when you remove a cable from SubnetA's group of NICs?

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
Gabriel_Vieira
Contributor
Contributor

One important detail missing is the fact that crash happens only if a VM is running on that subnet.

If I have no running VM then unplugging network cable does not cause a crash.

0 Kudos
jamesbowling
VMware Employee
VMware Employee

Have you looked at the logs on the host to see if you see anything regarding any errors? That would be the first place to look.

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
Gabriel_Vieira
Contributor
Contributor

On VMware I have the following events on VCenter:

Host is not responding

error

04-11-2010 11:11:42

VMNIX: <0>Dazed and confused, but trying to
continue (0:00:06:53.025 cpu0:4096)
warning
04-11-2010 11:09:14

VMNIX: <0>Do you have a strange power saving
mode enabled? (0:00:06:53.025 cpu0:4096)
warning
04-11-2010 11:09:14

VMNIX: <0>Uhhuh. NMI received for unknown

reason 3d. (0:00:06:53.024 cpu0:4096)

warning

04-11-2010 11:09:14

APIC: 1385: Lint1 interrupt on pcpu 0

(0:00:06:53.024 cpu0:4096)

warning

04-11-2010 11:09:14

0 Kudos
jamesbowling
VMware Employee
VMware Employee

Just for all around information:

- Are you using the defaults for NIC Teaming on your PortGroup for SubnetB?

- Are you able to view the host logs directly instead of through vCenter?

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
Gabriel_Vieira
Contributor
Contributor

All default values

0 Kudos
jamesbowling
VMware Employee
VMware Employee

We would need to look further into the host logs such as /var/log/vmware/hostd.log. We may be able to see something else happening. Does this happen on either NIC being removed from SubnetB.

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
Gabriel_Vieira
Contributor
Contributor

Hello All

I'm sorry I didn't reply earlier and thank you for all your help. It was an hardware problem after all.

IBM replaced both motherboard and NIC board at the same time and the problem never happened again.

Thank you for all the help once more.

Gabriel

0 Kudos
idle-jam
Immortal
Immortal

Glad to hear that, how do you know it's a hardware problem in the first place? did you run any hardware diagnostic check?

0 Kudos
Gabriel_Vieira
Contributor
Contributor

Hello idle

Toghether with IBM we run a series of tests and upgraded firmware to latest levels. Although IBM hardware tests always reported everything in good status a crash happened while repeating the tests (ESX was down at the time). After this event IBM Labs decided to replace both hardware parts that had been replaced previously at diferent times.

Happy New Year.

Gabriel

0 Kudos