9 Replies Latest reply on Apr 26, 2007 3:00 PM by conradsia

    HBA has failed alarm?

    conradsia Hot Shot

      Does anyone have a way of getting an alert when an HBA fails?

       

      The clients fiber switch doesn't send alerts and no snmp monitoring is configured so I'm looking for a way in ESX or VC to be notified when I lose a HBA connection from the server to the switch. The same for a network card failure.

       

      Anyone?

       

      Thanks,

      Conrad

        • 1. Re: HBA has failed alarm?
          bretti Expert

          There is probably something in /var/log/vmkwarning that you could monitor for.  Or some kind of syslog script.

          1 person found this helpful
          • 2. Re: HBA has failed alarm?
            VTorque Enthusiast

            Or could your SAN controllers alert you?

             

            Message was edited by:

                    VTorque

            • 3. Re: HBA has failed alarm?
              conradsia Hot Shot

              I can get alarms from the SAN controllers but I would really like to get an alert from ESX as well.

              • 4. Re: HBA has failed alarm?
                bretti Expert

                I was thinking about this some more and was wondering if you are running dual HBAs and dual NICs on your host server?

                 

                You could try pulling one of the HBA cables and see what shows up in the logs.  If you found a key phrase to search on, maybe DISK or VMHBA in a particular log file that would help.

                 

                Also, I was wondering if you were looking for real time notification or if a daily notification was ok?  If a daily is ok, there may be a script you can throw together that will look for new entries in the log files and alert you if there are any errors.

                1 person found this helpful
                • 5. Re: HBA has failed alarm?
                  grasshopper Virtuoso

                  If you have a failure of a path (or to a secondary HBA) you may see some, or all of the following:

                  "Manual switchover to path vmhbax:y:z begins"

                  "Changing active path to vmhbax.y.z"

                  "Manual switchover to vmhbax.y.z completed successfully"

                  /code

                  -or-

                  "Delaying failover to path vmhbax.y.z"

                  "Manual switchover to path vmhbax.y.z begins."

                  "Manual switchover to vmhbax.y.z completed successfully."

                  /code

                  -or, our worst nightmare-

                  "Manual switchover to path vmhbax.y.z begins."

                  "Did not switchover to vmhbax.y.z"

                  /code

                  (where x.y.z are the adapter.target.lun, of course[/i])

                   

                  Further, if ESX 3.x detects a hardware failure of the HBA, it may also log the following text:

                  "Unrecoverable hardware error : Adapter being marked offline"

                  /code

                   

                  Where to look?  If it has already been written to the logs, you can check for it here:

                  /var/log/vmkernel

                   

                  Or, for real time access to logging that hasn't been written to disk yet:

                  /proc/vmware/log

                   

                  What to look for?  The key word here is 'switchover'

                  • 6. Re: HBA has failed alarm?
                    jparnell Hot Shot

                    What hardware are you using?

                     

                    We've got HP DL585's. We have installed the HP insight agents on each of our hosts, and then set them up to send SNMP traps to HP Openview Operations when things like HBA or NIC connections fail.

                     

                    I'm sure there are similar agents for Dell and IBM hardware.

                    • 7. Re: HBA has failed alarm?
                      conradsia Hot Shot

                      Thanks everyone it's all very helpful. We are using IBM x366 and yes I am looking for a real time notification to let me know an hba has failed.

                       

                      Grasshopper, your input was very helpful, thanks.

                       

                      I can look through logs myself but I really don't want to have to do that, I'm looking into what damage will be done loading a director agent onto ESX. The last time I loaded it, it was using java which I really didn't like.

                       

                      I don't get anything on the hba's from snmp or in the normal director agent, but I'm going to load in a RSA card and see if that does anything for me.

                       

                      #conrad

                      • 8. Re: HBA has failed alarm?
                        bretti Expert

                        Unfortunately I don't think Director or an RSA card will alert you on failed HBA's. 

                         

                        I agree with you on loading director, not a good idea.  My opinion is that the director software causes more trouble than good.  The RSA's are really nice.  It seems that both Director and RSA's are limited in their ability to report on HBA problems.

                         

                        How's are you at shell scripts? 

                        • 9. Re: HBA has failed alarm?
                          conradsia Hot Shot

                          I'm starting to get the impression that the only way is going to be with a shell script. We've loaded the RSA card but so far it doesn't look too promising.

                           

                          Thanks for everyone's input, if lightning strikes in the middle of the night and you get some genious idea please let me know.

                           

                          If I script something that works I'll be more than happy to post it.

                           

                          Conrad