VMware Cloud Community
qaz1
Contributor
Contributor

ESX server hangs when tring to rescan SAN !!

Hi,

my environment is running VC 2.0.1 and the ESX is 3.0.1

I have a problem with my ESX servers hangs when i try to do a rescan SAN to add new LUN

I found a post in virtrix.blogspot.com that this is a known issue (i didn't found any thing about it in VMware knowledgebase)

and the workaround is to rescan for new storage devices and new VMFS not at the same time :

http://virtrix.blogspot.com/2006/12/vmware-esx-freeze-on-san-rescan.html

does anyone else also had this problem ?

do you know when will vmware releases a patch ?

thank

0 Kudos
30 Replies
CWedge
Enthusiast
Enthusiast

There was also something about not using the Link in the right hand corner, but right clicking and doing it that way.

0 Kudos
Kindred_VMSuppo
Contributor
Contributor

I am having a similar issue. But it is limited to my new HP DL360 G5's with PCI Express HBA's. I have an open case with VM Support but they have not called me back in the past 3 days. (grrr)

What is the HBA card that you are running?

qaz1
Contributor
Contributor

Hi,

i have the same hardware (HP DL360 G5's ).

i am running with Emulex LP9802 HBA

please update if you get any info from vmware about this issue

thanks

0 Kudos
mewing
Contributor
Contributor

Same issue here with Dell 2850's and 2950's. Qlogic QLA236X hba's.

Doing a seperate rescan for the devices and VMFS seems to work ok. I'm paranoid so I put them into maintenance mode first in case something bad happens.

0 Kudos
qaz1
Contributor
Contributor

does anyone try to do a rescan from command line . does it has the affect on the ESX severs?

0 Kudos
Kindred_VMSuppo
Contributor
Contributor

Been 72 hours without contact from Support. I'm wondering why I am paying for it to begin with. Let's see if my Acct Mgr gets a response.

Anyways.

In my testing of 3 DL 360 G5's with 2 different HBA's (both PCIx) I have the same problem with the system locking up.

If only do 1 checkbox at a time on rescan, the system is fine. I consider this a short term work around until Support gives me a real answer, but you can keep working, at least.

I will post an update when Support decides to call me back.

0 Kudos
qaz1
Contributor
Contributor

great looking for hearing from you

0 Kudos
jdaunt
Enthusiast
Enthusiast

I almost always experience this problem when doing it from VC. If i do a esxcfg-rescan from command line, I never experience the issue.

That being said, I no longer use the GUI to add additional storage to my servers.

0 Kudos
youngcc
Contributor
Contributor

We have experienced these problems for more than two months using DL360 G5 and DL380 G5 models in an ESX cluster. HBAs are A8002 and A8003 Emulex.

This problem first appeared after the number of LUNs presented to the ESX cluster grew past 20. We are currently presenting 41 LUNs to the cluster, roughly half are VMFS while the others are RDM.

As the number of presented LUNs grew, I also had to increase the VI client timeout. I currently have to set to 15 minutes for commands to complete.

At this point, we have resorted to placing the systems into maintenance mode when I'm working on storage.

0 Kudos
snowbird
Enthusiast
Enthusiast

I have the same experience with HP BL480C or 460C with the Qlogic 2462.

0 Kudos
hartza
Contributor
Contributor

We have also same issue with HP BL480C with the Qlogic 2462.

Would be great if someone comes up with fix for this one.

0 Kudos
douglasvmtn
Contributor
Contributor

Hi,

maybe I can give a hint. We had the same problems with doing a rescan and receiving a PSOD. We changed the connection option of our HBA (IBM FC2-133) from "Loop prefered, otherwise point to point" to "Point to point". Everything is okay now.

0 Kudos
hartza
Contributor
Contributor

thank you for a tip.

we have changed connection mode to "Point to point only" from "Loop prefered, otherwise point to point". also we have updated bl480c BIOS and Qlogic adapter BIOS as well. still freeze occurs in this condition.

0 Kudos
Kindred_VMSuppo
Contributor
Contributor

Is anyone having this issue where the server just freezes without doing the rescan?

0 Kudos
admin
Immortal
Immortal

I had this problem on 4GB Fibre Cards, it may not be related but one of the patches fixed it once I had applied them all.

0 Kudos
Kindred_VMSuppo
Contributor
Contributor

Update from Support. Apparently HP has acknowledged that there is an IRQ conflict between, at least, the DL360 G5 and ESX 3.0.1. HP is working on a firmware update to fix this issue but it is not available yet. More to come as Support enlightens me.

0 Kudos
jftwp
Enthusiast
Enthusiast

Yeah, they keep having me try different things related IRQ conflicts (update to latest BIOS, done---same problem). And now, next on the list, is to try disabling USB in the BIOS altogether...

Here's my related post over in another thread where so many others are seeing this issue: http://www.vmware.com/community/thread.jspa?threadID=67309&messageID=610086#610086

***************

Update: I attempted upgrading to the latest BIOS, as requested by HP VMware support (which had escalated to VMware engineering directly), which they hoped would relieve IRQ-sharing which, in turn, they think might be the underlying root cause of HBA scans killing ESX.

I see that jiihoo74 did the same thing/approach, yet with BL460G1's, and first thought that corrected his HBA scanning, but ultimately did not.

I upgraded and tried a scan. Boom. Killed it on first try. So much for the 'try the latest BIOS' approach.

I have subsequently been asked by support to (as others have reported with mostly successful results) 'try' disabling USB entirely in the BIOS.

If this works, (and believe me, I'll do my best to scan until the cows come home, trying to reproduce the problem), then at least that definitively says "Hey HP and VMware -


get together and solve this thing once and for all" (because disabling USB in certain Proliant/blade servers is not a 'solution', but a workaround to a pretty severe bug).

Will advise.

0 Kudos
wobbly1
Expert
Expert

just a thought but have you tried reducing the number of luns scanned from the default of 256?

0 Kudos
James_Murtagh
Contributor
Contributor

Same issue with DL585 G2s - all six of them - all FC HBAs in PCI express slots. We had to upgrade to the latest HP BIOS just to install ESX 3.0.1.

Disabled USB completely in the BIOS and disabled the iLO card - same problem. The workaround does seem to be just rescanning the HBAs, unchecking the 'scan vmfs' box. I've ran a rescan from the command line hundreds of times from a script and not once did I get a problem. We have also reduced the number of luns scanned down to about 32 - didn't make any difference.

We are also getting random hangs very frequently - no log entries or dumps, just connection failed followed by an ASR. How are you meant to troubleshoot that? And what is so serious that the kernel is completely hung and can't log anything?

0 Kudos