Solved: Re: ESX reboots when switch has a broadcast storm

Kevin_Gao · ‎11-01-2008

Hi all,

I have a strange issue;

- I was doing some work on the weekend and added some new fibre modules onto our HP Procurve (LAN) switch to connect to another Procurve.

- The modules seem to be causing a broadcast storm on the switch so I unplugged it and it went back to being normal.

- I got a whack of alerts saying that a bunch of VM's are unreachable it was then that I realized one of my VM hosts (3.5 build 110268) rebooted itself.

- After the VM's all restarted I consoled in and gathered that it had crashed.

- I went back to playing with the fiber module and caused another broadcast storm on the LAN switch; almost within 1 minute the same ESX host rebooted itself again.

My question is; why is it doing that?

I run iSCSI and I had thought that so long as the VMKernel on the host can still see a live gateway it shouldn't induce a kernel panic in order to activate HA. I know 100% certain that both of our iSCSI switches are up and running so the VMKernel should be happy with that. Also all the other ESX hosts 3.5 / 3.5i are configured the same way and none of them rebooted.

Does anyone have any ideas or should I give VMware a call on Monday?

Thanks a bunch.

RenaudL · ‎11-02-2008

This is strange, we made sure ESX 3.5 wouldn't crash in this situation. I remember doing the experiment by looping switches and observing ESX handle the storm without any major issue (we actually have a built-in mechanism to detect them).

Do not hesitate contacting VMware support about this.

View solution in original post

Texiwill · ‎11-02-2008

Hello,

I run iSCSI and I had thought that so long as the VMKernel on the host can still see a live gateway it shouldn't induce a kernel panic in order to activate HA. I know 100% certain that both of our iSCSI switches are up and running so the VMKernel should be happy with that. Also all the other ESX hosts 3.5 / 3.5i are configured the same way and none of them rebooted.

A slight miss conception. VMware HA does not induce a panic or a panic is not required to activate VMware HA. It must be fully isolated to invoke VMware HA. I think something else is going on.

Does anyone have any ideas or should I give VMware a call on Monday?

I would call, but start with gathering a vm-support file and reviewing /var/log/vmkernel. Also, if this is a two node cluster and you are running VMware VI3.5 Update 2, you should also upgrade VC 2.5 to Update 3 and ESX to the 10/3 patches else HA will not work.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

RenaudL · ‎11-02-2008

This is strange, we made sure ESX 3.5 wouldn't crash in this situation. I remember doing the experiment by looping switches and observing ESX handle the storm without any major issue (we actually have a built-in mechanism to detect them).

Do not hesitate contacting VMware support about this.

Paul_Lalonde · ‎11-03-2008

Hi Kevin,

When I worked for VMware's networking support group, we had uncovered several issues with the VMkernel network drivers during broadcast storms and switch loops. Which specific network adapters are you using? e1000, tg3, or bnx2?

Paul

Kevin_Gao · ‎11-03-2008

Thanks for all the responses so far. I apologize for my late reply but I just got back into the office. Thanks for clearing up the kernel panic requirements texiwell.

Paul I'm very curious about what you said and here's my NIC's:

LAN side: Broadcom NetXtreme II BCM5708 1000Base-T (BNX2), Intel 82571EB (E1000)

iSCSI side (which I don't think's causing this issue): Broadcom NetXtreme II BCM5706 (BNX2), Intel 82571EB (E1000)

NIC configuration:

- Route based on IP hash

- Beacon Probing is on

- Notify switch on failure in on

- Failback is on

Keep in mind I have other hosts with the same setup and they appear to be unaffected. My next course of action is to gather the logs and give VMware a call. Also I will probably turn on spanning tree on that ProCurve.

Thanks everyone so far.

All

ESX reboots when switch has a broadcast storm