VMware Cloud Community
jftwp
Enthusiast
Enthusiast

ESX 3.0.1 with Qlogic HBA Rescan in VI Client = ESX Hangs then 'comes back'

I've opened up a ticket with support, but this one's VERY discomforting.

ESX 3.0.1 / VC 2.0.1 (all servers identical, HP Proliant DL360G5 with dual-port HP StorageWorks (Qlogic OEM) FC1242SR cards

4-node ESX HA cluster (DRS enabled due to patch t.b applied for known issue 2066306.

I routinely (have done before, same steps) presented out 2 new 300GB fiber channel LUNs from our 3Par SAN to the WWN's in all hosts, then off I go into VirtualCenter/VI3 2.0.1 client.

Select a host, click Configuration tab, Storage Adapters, Rescan... (select both checkboxes, as usually do)... click OK -


boom...! Instead of the usual/expected rescan after which new LUNs seen, server goes offline/disconnected in VI client, host does not respond to pings, and HA moves the running VMs to alternate hosts in HA cluster.

Server meanwhile is completely unresponsive (loses VC connection, and doesn't respond over network either).

After 5 minutes(ish) -


suddenly reconnects itself---but it never actually dumped/rebooted. Then, really weird stuff:

1. VC shows the VMs that WERE[/i] running on it at time of hang/crash, then alternates/toggles between what's REALLY there (nothing). This 'toggle' happens every 10 seconds, like clockwork.

2. Could not put server into Maint Mode. Would give 'Operation Timed Out' error within a few seconds.

I finally rebooted the thing, and now it appears to behave. Also, note that it DID recognize the 'rescan' and saw the new drives presented to its HBA's from the SAN, so at least I've got that going for me.

I wonder if it's just plain SAFER(?) to issue the cmd line rescan option instead of doing it from within VC/VI3 client? Doing stuff 'at' the servers/hosts themselves, even since 2.5x days, always seems to be more reliable than issuing commands via VI3/VC.

Anyone seen anything like this before? This also happened (same exact story), on another host in the cluster (but same hardware, ok) about 3 months ago when I was running 3.0.0 / VC 2.0

0 Kudos
101 Replies
jftwp
Enthusiast
Enthusiast

Update: Support looking at vmsupport/vc logs.

Problem can be reproduced[/u] on this host when using 'Rescan' in upper right of Storage Adapters/Configuration screen in VI3.

Workaround: Scan each fiber channel HBA independenly via right-clicking on the given vmhba and selecting 'rescan' that way.

So, meanwhile, support looking at logs. Also, I am wondering what EXACTLY the 'rescan' in upper right does, command-wise, with and without the checkboxes selected, VERSUS selecting each HBA independently then right-clicking and 'rescan' that[/i] way (which doesn't have any checkboxes/options to specifically rescan for storage devices and/or vmfs volumes AND which doesn't cause the hang/works fine).

Does the 'big' "Rescan" with checkboxes/options

0 Kudos
gogogo5
Hot Shot
Hot Shot

Yes, we had EXACTLY same the same issue on our hosts (DL380G5) but using Emulex cards.

VM Support fixed it by setting the Disk.MaxLUN value from the default of 256 (which is apparently quite high) to a lower setting, in our case 50.

I ran 10 consecutive Rescans using the link in top right corner of VC and no hangs Smiley Happy

Hope that helps.

jftwp
Enthusiast
Enthusiast

Thanks-thought it was just 'me'?! I have yet to hear back from support, but I'm wary of that very thing-you see, we switched from VMware direct support to HP-branded support (24x7 platinum, so the high end flavor, etc.) mostly because we received a great deal upgrade-wise, from 2.5 to 3.0 Enterprise, etc. etc.

Anyway, I'm not dissing HP support because they're very good, but the fact is that they're \*not* VMware support which, overall, in the 14 months or so I've been working with ESX 2.5/3, has been very, very good.

My current workaround, to be SAFE, is to treat HBA rescans as a maintenance mode kinda thing wherein I first MOVE VMs. There. Safe. Who cares if host crashes. Then, I scan 1 HBA at a time (via right-click on HBA itself). No crash with that approach.

I'll expect that by Monday HPs vmware support staff will have analyzed the logs and/or at LEAST searched VMware's KB, this forum, or otherwise, and will offer up a similar suggestion for you.

Is there a KB article, for what support ended up having you tune to that '50' value? Thanks again.

0 Kudos
jftwp
Enthusiast
Enthusiast

Whoops --- meant 'will offer up a similar suggestions such as yours' (not you).

0 Kudos
gogogo5
Hot Shot
Hot Shot

Hi jftwp - just so you know, my support case had to be escalated to another engineer, thats how complex it got. The engineer said he would write a KB because of this but since this was the fix found toward the end of last month I guess it could be while until officially QA'd and released?

Can you try the setting to see if it helps you? Thanks for points too Smiley Wink

0 Kudos
jftwp
Enthusiast
Enthusiast

Sure I'll try it... would confirm either way whether that's the fix or not. Odd, however, that more users haven't reported this problem----perhaps they're either scanning via cmd line/ssh direct on each host, or perhaps they're scanning via right-click on each HBA via VI3 client. Hmmm...

Anyway, 2 things please:

1. Do you still have the case number handy from your incident?

2. Where is the change made/edited. Steps?

Thanks!

0 Kudos
gogogo5
Hot Shot
Hot Shot

Other users have reported this issue:

http://www.vmware.com/community/thread.jspa?messageID=546684��

1. I'll check my SR number at work tomorrow.

2. In VC, goto your host, select the Configuration tab, click Advanced, scroll down to Disk.MaxLUN and change this from the default of 256 to 50 (50 is purely a lower arbitrary value, works in our environment, may or may be applicable to yours, but then I know our hosts will never need visibility of anywhere near 50 luns for the foreseeable future).

No reboot is necessary.

Ivor
Contributor
Contributor

I have the same issue. Please keep the topic up to date.

Ivor

0 Kudos
jftwp
Enthusiast
Enthusiast

Gogogo: Can you confirm your support number[/b] when you worked with the engineer to decrease the LUN value from 256 to 50? I'm working with HP VMware support and they're talking to VMware support now, have sent them the logs, etc. Might be helpful for all concerned now that we can see many folks have reported this same issue with quite a lot of different hardward combos. Thanks.

0 Kudos
gogogo5
Hot Shot
Hot Shot

Hi jftwp - apologies, meant to do this yesterday but got sidetracked!

SR#327770 - good luck.

0 Kudos
thickclouds
Enthusiast
Enthusiast

Any updates on this????

Charlie Gautreaux vExpert http://www.thickclouds.com
0 Kudos
gogogo5
Hot Shot
Hot Shot

I have posted the fix (for our environment) above. What else is there to update?

0 Kudos
jftwp
Enthusiast
Enthusiast

Gogogo, thanks for the SR---I've notified our HP support rep, and he's also sent this thread's content to VMware directly.

Cpfcg, you can try the Disk.MaxLUN downward adjustment/value as Gogogo suggests, else try rescanning each vmhba instead of using the 'Rescan' option. Right-click on each vmhba% path under each/every HBA device you have installed, and select 'Rescan'. This 'workaround' seems to provide a stable approach. However, you might want to consider it an internal 'best practice' to FIRST MIGRATE ANY VMs TO OTHER HOSTS during such rescans, as a contingency.

0 Kudos
Eric_Napa
Contributor
Contributor

I am having the exact same problem. I have a SR open and forwarded this thread to my SE. I have all of the same Servers and HBA's as well so this might be a HP Problem. (i have a EVA 4000 SAN)

I am going to try the MAX LUN today but i feel that is just a workaround and i want someone to fix this problem before i go into Production.

Eric

Napa Superior Court

0 Kudos
gogogo5
Hot Shot
Hot Shot

I would be very interested to hear of your progress with this same issue.

jftwp - how is your SR going, any update since?

0 Kudos
Eric_Napa
Contributor
Contributor

They have me Updating BIOS and Drivers but really have been no help. I have a 3rd party SE that help me design and install this system and he will be on site today.

0 Kudos
jftwp
Enthusiast
Enthusiast

Nope... I had asked HP for an update on Thursday, but no response as of the time I left for vacation (out this week) on Friday. I haven't been too happy with their response/escalation, and your message reminded me to email them again, which I just did. I'm not about to call -


hey, I'm on vacation! Heh. Checking email sporadically at best. Will advise when I hear something, but something tells me that the LUN value will end up being the 'fix'. Won't be able to test/confirm that until next week, however.

0 Kudos
Eric_Napa
Contributor
Contributor

We tried the MAXLUN and still have the problem. We can however run Scan LUN and Scan VMFS separately and have no problem.

0 Kudos
PeteFry
Contributor
Contributor

got the same problem on BL460's tried setting the max luns to a low number and it still has the problem

i think its time to raise my own support call with VMware

0 Kudos