gopinathan
Contributor
Contributor

KB Article: 1016106 and vSphere ESXi 5

Did anyone experience the same issue described in this KB with ESXi5? With the HBA disabled, the ESXi5 host complete the load/boot in minutes. But with HBA enabled that has few RDM and some LUN that are not defined, it's taking hours to load/boot. I have case open with VMware on this and waiting. Please share your experience and any input will be appreciated. 

0 Kudos
20 Replies
NuggetGTR
VMware Employee
VMware Employee

Yeah thats been an issue forever, had this since ESX 3.

The resolution is in that KB just lower the retries and timeout, it helps but still is painfull, But in reality you shouldnt be rebooting too often so it doesnt matter.

What I did being a large orginisation is that I made a purpose built MSCS ESX cluster so there was only a few host that were affected, and everything else sits on the main corprate cluster.

Dont know if having a support case open will accomplish anything for this problem.

________________________________________ Blog: http://virtualiseme.net.au VCDX #201 Author of Mastering vRealize Operations Manager
0 Kudos
AARCO
Contributor
Contributor

Hi,

I have the same problem.  Before, in ESXi 4.1 I change the value of Scsi.CRTimeoutDuringBoot to 1 and work for me.

Right now, in ESXi 5 I dont see this parameter ....

¿any idea or solution?

0 Kudos
gopinathan
Contributor
Contributor

Still waiting to hear from VMware Support.

I know we don't need to reboot the ESX hosts so often once it is loaded and in service. But think about the time it takes to load is you have 500+ hosts to upgrade to ESXi 5.

I do have MSCS isolated to few clusters only.

0 Kudos
john23
Commander
Commander

Have you tried to set the parameter mentioned in the KB?

Thanks -A Read my blogs: www.openwriteup.com
0 Kudos
gopinathan
Contributor
Contributor

As AARCO mentioned above, those advanced options are not available in 5.0

0 Kudos
ashleyw
Contributor
Contributor

I see the same hang during boot under vSphere 5i on both Cisco UCS blades, HP DL360G7s and nested vSphere5i instances - all connecting to iscsi devices. In my situation some of our iSCSI SANs are not on the HCL anymore for vSphere5 and consequently when I tried to raise the issue with VMware support, support was quite limited other than to confirm that my iscsi configuration was correct. I can reproduce the issue on a nested ESX5i instance hooking up to a NexentaStor device. I suspect the issue is a generic issue with vSphere5i - VMware - please can you look at this. We are running these iscsi devices; HP MSA2000, HP MSA 2012i, NexentaStor. Boot time varies between 10 minutes and one hour depending on the configuration.

0 Kudos
kchowksey
VMware Employee
VMware Employee

For ESX 5.0:

On the ESX hosts that are running MSCS VMs, identify LUNs exported as RDMs to VMs

eg. naa.<lunid>

For each LUN identified above, perform this configuration from the esx command line:

esxcli storage core device setconfig -d naa.<lunid> --perennially-reserved yes

The subsequent ESX reboot should no longer be slow. KB 1016106 will be updated ASAP with this information.

Thanks.

-Kapil
0 Kudos
kchowksey
VMware Employee
VMware Employee

@ashleyw: doesen't look like you are running MS Failover Clustering are you ? Since VMware doesen't
support MSCS over iSCSI. Looks like your slow boot problem is unrelated to MSCS.

Could you please run these on the ESX command shell:

~# cd /var/run/log

~# fgrep '0xb 0x24 0x' vmkernel.log

~# for i in vmkern*gz; do gzip -cd $i | fgrep '0xb 0x24 0x' ; done

If it turns up a bunch of matches, we know this issue exists with a bunch of iscsi targets (a target

bug, not ESX).

If not, please open an SR; or just give me the SR id if you already provided vmware with full support logs.

-Kapil
0 Kudos
gopinathan
Contributor
Contributor

Thanks. I will try this out and post the results later. 

0 Kudos
ashleyw
Contributor
Contributor

@kchowksey: no I'm not running MS Failover Clustering.

When I run the fgrep command it doesn't find anything. The vmkernal logs have not been gzipped yet so there are no vmkern*gz files.

the case number I attached the log files to was; 11096075809

I've attached the log file from our nested ESX5i host that shows the same "hang" at boot time connecting only to a NexentaStor box via iscsi - the "hang" time in this situation is around 4 minutes - but interestingly I see a lot of "Network is unreachable" and "iscsid: Login Failed" errors which is interesting as there are no issues with the connectivity - I see these same type of messages on our production farm as well.

update on 14/09/2011 18:45: I have removed the log file to avoid confiusion - see below..

0 Kudos
kchowksey
VMware Employee
VMware Employee

Thanks ashley. Have forwarded your report to the right people. Suggest contacting Nexenta support too.

-Kapil
0 Kudos
ashleyw
Contributor
Contributor

thanks for your help. To eliminate as much garbage as possible form the log files (as I may have appended some incorrect information), I cleared all logs and then rebooted - it took around 6 minutes on the nested esxi box... the bulk of the time was spend during the iscsi phase after vmw_satp_alua loaded successfully message on the console. On a UCS blade, this process takes aorund 15 minutes, on a DL360G7 the process takes around 30 minutes - see

http://communities.vmware.com/thread/326077?tstart=0

I've summarised the logs as a single small attachment.

When I look closely at the vmkernel.log file I see the bulk of the time is spent in this section;

<pre>

...

...

2011-09-14T04:50:40.503Z cpu0:2604)ScsiDevice: 3121: Successfully registered device "naa.600144f02aa50c0000004e640a430001" from plugin "NMP" of type 0
2011-09-14T04:50:40.524Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
2011-09-14T04:50:40.555Z cpu0:2604)VC: 1449: Device rescan time 20 msec (total number of devices 5)
2011-09-14T04:50:40.555Z cpu0:2604)VC: 1452: Filesystem probe time 29 msec (devices probed 5 of 5)
2011-09-14T04:50:43.471Z cpu0:2050)LVM: 13188: One or more LVM devices have been discovered.
2011-09-14T04:51:06.754Z cpu1:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
2011-09-14T04:51:06.775Z cpu1:2604)VC: 1449: Device rescan time 22 msec (total number of devices 5)
2011-09-14T04:51:06.775Z cpu1:2604)VC: 1452: Filesystem probe time 19 msec (devices probed 5 of 5)
2011-09-14T04:51:32.987Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported

...

</pre>

For some reason, it looks like it is repeatedly trying to access vmhba32 which appears to be the controller the CDrom device is hanging off. sigh,..

I guess this is a bug in vsphere5? Please advise.

0 Kudos
MichaelW007
Enthusiast
Enthusiast

I managed to make a little progress on this today. To the point where the host rescan times at least have come down to a minute. Thanks to @kchowksey for some good suggestions. I noticed that my QNAP was being picked up as an ALUA array. This was in addition to the failed IO with sense data 0xb 0x24 0x0.

The claim rule I applied was as follows:

esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

From what I can tell this problem impacts QNAP and Netgear. I've also got OpenFiler and it didn't appear to be impacted, but I have done limited testing. Note that none of these storage systems are currently on the HCL. I believe the reason for the problem is that the iSCSI targets do not implement the t10 standards correctly. I'm going to be working with VMware support on this as well. So far the only iSCSI storage I've got that works is the HP P4000 aka Lefthand Networks VSA's with SAN/IQ9x.

0 Kudos
MichaelW007
Enthusiast
Enthusiast

For iSCSI access to targets from vSphere 5 hosts it'll try and access every target for discovery from every vmkernel port that is bound to the initiator. It will try a number of times for each combination, until it'll finally give up and move on. 

gopinathan
Contributor
Contributor

I see the KB is updated now. It's a pain to go thru all of te ESX hosts and run the command against each RDM LUN.

The following link has the PowerCLI script to find the RDM LUN. Hope the same can be extended to run the recommendation in the KB article.

http://www.virtu-al.net/2008/12/23/list-vms-with-rdm/

0 Kudos
AARCO
Contributor
Contributor

Hello,

We have QNAP too.  We try de  command:

esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

but this error arises:

~ # esxcli nmp satp rule add -d naa.6001405a0f1cc60ddafed4daedbc09df  -s VMW_SATP_DEFAULT_AA -o disable_ssd
Error: Unknown command or namespace nmp satp rule add

We have ESXi 5.0 ....

0 Kudos
ileidi
Contributor
Contributor

I have the same problem too. ESXi 5.0 on an HP DL380 G7 with iSCSI Lun on a QNAP 809 U PRO. I have a slow boot time..8 minutes to boot the ESXi:

2011-09-29T07:01:00.482Z cpu0:4120)ScsiDeviceIO: 2305: Cmd(0x4124003ed900) 0x12, CmdSN 0x43 to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.
2011-09-29T07:01:00.494Z cpu0:4120)ScsiDeviceIO: 2305: Cmd(0x4124003ed900) 0x12, CmdSN 0x43 to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.
2011-09-29T07:01:00.531Z cpu0:4120)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x12 (0x4124003ed900) to dev "naa.600140550c978cedf241d4b5fda8eedb" on path "vmhba32:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x24 0x0.Act:NONE

If i use the command indicated i have this output:

# esxcli nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

Error: Unknown command or namespace nmp satp rule add

Anyone have this problem?
Best Regards
Andrea

0 Kudos
MichaelW007
Enthusiast
Enthusiast

Hi Guys,

My fault, sorry, in ESXi 5.0 the esxcli commands changed slightly. Now you need to specify which top level namespace, i.e. esxcli storage, or esxcli network.

So the command you'd run would be:

esxcli storage nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

Sorry that my earlier post was incorrect and neglected to include the "storage" part of the command.

0 Kudos
ileidi
Contributor
Contributor

Hi Micheal,

I tried to reconnect my QNAP with ESX. I create a Kernel Port (Binding only on vnic0 because i use a nic teaming) i connect the iSCSI Lun and after some minutes (the scan of new storage is very slow) i launch the command from CLI:

esxcli storage nmp satp rule add -d "naa.600140550c978cedf241d4b5fda8eedb" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

However nothing change. In the storage adpter section i see the QNAP Lun flapping up and down to death or error operational state Smiley Sad

In the Log file vmkernel.log i have this error now:

2011-10-04T07:53:21.907Z cpu14:4110)ScsiDeviceIO: 2316: Cmd(0x412441852f40) 0x9e, CmdSN 0x291d to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2011-10-04T07:53:31.907Z cpu16:4112)ScsiDeviceIO: 2316: Cmd(0x412441852f40) 0x25, CmdSN 0x291e to dev "naa.600140550c978cedf241d4b5fda8eedb" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Error code change from 0xb 0x24 0x0 to 0x0 0x0 0x0

0 Kudos