1 2 Previous Next 20 Replies Latest reply on Oct 4, 2011 5:22 AM by MichaelW007

    KB Article: 1016106 and vSphere ESXi 5

    gopinathan Enthusiast

      Did anyone experience the same issue described in this KB with ESXi5? With the HBA disabled, the ESXi5 host complete the load/boot in minutes. But with HBA enabled that has few RDM and some LUN that are not defined, it's taking hours to load/boot. I have case open with VMware on this and waiting. Please share your experience and any input will be appreciated. 

        • 1. Re: KB Article: 1016106 and vSphere ESXi 5
          NuggetGTR Expert
          vExpertVMware Employees

          Yeah thats been an issue forever, had this since ESX 3.

           

          The resolution is in that KB just lower the retries and timeout, it helps but still is painfull, But in reality you shouldnt be rebooting too often so it doesnt matter.

           

          What I did being a large orginisation is that I made a purpose built MSCS ESX cluster so there was only a few host that were affected, and everything else sits on the main corprate cluster.

           

          Dont know if having a support case open will accomplish anything for this problem.

          • 2. Re: KB Article: 1016106 and vSphere ESXi 5
            AARCO Lurker

            Hi,

             

            I have the same problem.  Before, in ESXi 4.1 I change the value of Scsi.CRTimeoutDuringBoot to 1 and work for me.

             

            Right now, in ESXi 5 I dont see this parameter ....

             

            ¿any idea or solution?

            • 3. Re: KB Article: 1016106 and vSphere ESXi 5
              gopinathan Enthusiast

              Still waiting to hear from VMware Support.

              I know we don't need to reboot the ESX hosts so often once it is loaded and in service. But think about the time it takes to load is you have 500+ hosts to upgrade to ESXi 5.

              I do have MSCS isolated to few clusters only.

              • 4. Re: KB Article: 1016106 and vSphere ESXi 5
                john23 Expert

                Have you tried to set the parameter mentioned in the KB?

                • 5. Re: KB Article: 1016106 and vSphere ESXi 5
                  gopinathan Enthusiast

                  As AARCO mentioned above, those advanced options are not available in 5.0

                  • 6. Re: KB Article: 1016106 and vSphere ESXi 5
                    ashleyw Hot Shot

                    I see the same hang during boot under vSphere 5i on both Cisco UCS blades, HP DL360G7s and nested vSphere5i instances - all connecting to iscsi devices. In my situation some of our iSCSI SANs are not on the HCL anymore for vSphere5 and consequently when I tried to raise the issue with VMware support, support was quite limited other than to confirm that my iscsi configuration was correct. I can reproduce the issue on a nested ESX5i instance hooking up to a NexentaStor device. I suspect the issue is a generic issue with vSphere5i - VMware - please can you look at this. We are running these iscsi devices; HP MSA2000, HP MSA 2012i, NexentaStor. Boot time varies between 10 minutes and one hour depending on the configuration.

                    • 7. Re: KB Article: 1016106 and vSphere ESXi 5
                      kchowksey Novice
                      VMware Employees

                      For ESX 5.0:

                       

                      On the ESX hosts that are running MSCS VMs, identify LUNs exported as RDMs to VMs

                      eg. naa.<lunid>

                       

                      For each LUN identified above, perform this configuration from the esx command line:

                       

                      esxcli storage core device setconfig -d naa.<lunid> --perennially-reserved yes

                       

                      The subsequent ESX reboot should no longer be slow. KB 1016106 will be updated ASAP with this information.

                       

                      Thanks.

                      • 8. Re: KB Article: 1016106 and vSphere ESXi 5
                        kchowksey Novice
                        VMware Employees

                        @ashleyw: doesen't look like you are running MS Failover Clustering are you ? Since VMware doesen't
                        support MSCS over iSCSI. Looks like your slow boot problem is unrelated to MSCS.

                         

                        Could you please run these on the ESX command shell:

                         

                        ~# cd /var/run/log

                        ~# fgrep '0xb 0x24 0x' vmkernel.log

                        ~# for i in vmkern*gz; do gzip -cd $i | fgrep '0xb 0x24 0x' ; done

                         

                        If it turns up a bunch of matches, we know this issue exists with a bunch of iscsi targets (a target

                        bug, not ESX).

                         

                        If not, please open an SR; or just give me the SR id if you already provided vmware with full support logs.

                        • 9. Re: KB Article: 1016106 and vSphere ESXi 5
                          gopinathan Enthusiast

                          Thanks. I will try this out and post the results later. 

                          • 10. Re: KB Article: 1016106 and vSphere ESXi 5
                            ashleyw Hot Shot

                            @kchowksey: no I'm not running MS Failover Clustering.

                             

                            When I run the fgrep command it doesn't find anything. The vmkernal logs have not been gzipped yet so there are no vmkern*gz files.

                            the case number I attached the log files to was; 11096075809

                             

                            I've attached the log file from our nested ESX5i host that shows the same "hang" at boot time connecting only to a NexentaStor box via iscsi - the "hang" time in this situation is around 4 minutes - but interestingly I see a lot of "Network is unreachable" and "iscsid: Login Failed" errors which is interesting as there are no issues with the connectivity - I see these same type of messages on our production farm as well.

                             

                            update on 14/09/2011 18:45: I have removed the log file to avoid confiusion - see below..

                            • 11. Re: KB Article: 1016106 and vSphere ESXi 5
                              kchowksey Novice
                              VMware Employees

                              Thanks ashley. Have forwarded your report to the right people. Suggest contacting Nexenta support too.

                              • 12. Re: KB Article: 1016106 and vSphere ESXi 5
                                ashleyw Hot Shot

                                thanks for your help. To eliminate as much garbage as possible form the log files (as I may have appended some incorrect information), I cleared all logs and then rebooted - it took around 6 minutes on the nested esxi box... the bulk of the time was spend during the iscsi phase after vmw_satp_alua loaded successfully message on the console. On a UCS blade, this process takes aorund 15 minutes, on a DL360G7 the process takes around 30 minutes - see

                                http://communities.vmware.com/thread/326077?tstart=0

                                 

                                I've summarised the logs as a single small attachment.

                                 

                                When I look closely at the vmkernel.log file I see the bulk of the time is spent in this section;

                                <pre>

                                ...

                                ...

                                2011-09-14T04:50:40.503Z cpu0:2604)ScsiDevice: 3121: Successfully registered device "naa.600144f02aa50c0000004e640a430001" from plugin "NMP" of type 0
                                2011-09-14T04:50:40.524Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
                                2011-09-14T04:50:40.555Z cpu0:2604)VC: 1449: Device rescan time 20 msec (total number of devices 5)
                                2011-09-14T04:50:40.555Z cpu0:2604)VC: 1452: Filesystem probe time 29 msec (devices probed 5 of 5)
                                2011-09-14T04:50:43.471Z cpu0:2050)LVM: 13188: One or more LVM devices have been discovered.
                                2011-09-14T04:51:06.754Z cpu1:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported
                                2011-09-14T04:51:06.775Z cpu1:2604)VC: 1449: Device rescan time 22 msec (total number of devices 5)
                                2011-09-14T04:51:06.775Z cpu1:2604)VC: 1452: Filesystem probe time 19 msec (devices probed 5 of 5)
                                2011-09-14T04:51:32.987Z cpu0:2604)FSS: 4333: No FS driver claimed device 'mpx.vmhba32:C0:T0:L0': Not supported

                                ...

                                </pre>

                                For some reason, it looks like it is repeatedly trying to access vmhba32 which appears to be the controller the CDrom device is hanging off. sigh,..

                                I guess this is a bug in vsphere5? Please advise.

                                • 13. Re: KB Article: 1016106 and vSphere ESXi 5
                                  MichaelW007 Enthusiast

                                  I managed to make a little progress on this today. To the point where the host rescan times at least have come down to a minute. Thanks to @kchowksey for some good suggestions. I noticed that my QNAP was being picked up as an ALUA array. This was in addition to the failed IO with sense data 0xb 0x24 0x0.

                                   

                                  The claim rule I applied was as follows:

                                   

                                  esxcli nmp satp rule add -d "<naa.deviceid>" -s "VMW_SATP_DEFAULT_AA" -o "disable_ssd"

                                   

                                  From what I can tell this problem impacts QNAP and Netgear. I've also got OpenFiler and it didn't appear to be impacted, but I have done limited testing. Note that none of these storage systems are currently on the HCL. I believe the reason for the problem is that the iSCSI targets do not implement the t10 standards correctly. I'm going to be working with VMware support on this as well. So far the only iSCSI storage I've got that works is the HP P4000 aka Lefthand Networks VSA's with SAN/IQ9x.

                                  • 14. Re: KB Article: 1016106 and vSphere ESXi 5
                                    MichaelW007 Enthusiast

                                    For iSCSI access to targets from vSphere 5 hosts it'll try and access every target for discovery from every vmkernel port that is bound to the initiator. It will try a number of times for each combination, until it'll finally give up and move on. 

                                    1 person found this helpful
                                    1 2 Previous Next