Solved: Re: Low level Hardware monitoring

arturo_hyperic · ‎03-27-2008

Folks :

We are currently evaluating the implementation of Hyperic in a Solaris installation, and would like to know the posibility of addressing a key concern (with Hyperic) : HW monitoring. The customer does not like Sun Management Center (a Sun product that provides quite poor monitoring) BUT it provides good detailed HW "low level" monitoring (fans, power supply voltage levels, Hard Drives status, etc).

With Hyperic I know you can monitor CPU / Memory / Process, but I need to know if it is possible to have access to this information somehow (the simpler the better) using Hyperic ? Is this configurable in the agent ?

Regards

Arturo

scottmf · ‎03-27-2008

there is also this (for x64 boxes) -> http://support.hyperic.com/confluence/display/hypcomm/Sun+X64+Systems
(uses snmp to show the temps and fan status, may take some customization)

Prtdiag is a really good way to check anything on the motherboard, and it is now on x64 solaris 10 (I think u2 and above). Unfortunately it can be vastly different btwn releases / platforms / patch revisions, but the exit status is consistent which is nice.

Here are the docs to create a script plugin ->

http://support.hyperic.com/confluence/display/DOC/Script+Plugin

For solaris monitoring in general, monitoring prtdiag and /var/adm/messages are key. I believe most (if not all) disk failures are captured in /var/adm/messages. You could also monitor your disk suite (SVM, Veritas, ZFS, etc..)

View solution in original post

scottmf · ‎03-27-2008

SMC has a lot of hooks into sparc hardware that you can't find anywhere else. Each sparc model has totally different HW specs which makes it very hard to keep updating the hooks which show you the box status.

One thing to consider is to write a custom plugin (not too difficult) which calls prtdiag -v and looks at the exit status. If the exit status > 0 (which would indicate something is wrong with the hardware) you could capture the output of prtdiag and track it as a log event to correlate the event and show you what is wrong.

arturo_hyperic · ‎03-27-2008

Yeap, SMC does its stuff great ... the option you mentioned (plug in calling prdiag -v) : can it provide detailed information on "status" (in terms of "working / non-working") of specific components ? (Hard Drives, Fans, etc) ? Do you have an idea on how hard this could be (in terms of time) for a sysadmin with some experience ?

AM

scottmf · ‎03-27-2008

there is also this (for x64 boxes) -> http://support.hyperic.com/confluence/display/hypcomm/Sun+X64+Systems
(uses snmp to show the temps and fan status, may take some customization)

Prtdiag is a really good way to check anything on the motherboard, and it is now on x64 solaris 10 (I think u2 and above). Unfortunately it can be vastly different btwn releases / platforms / patch revisions, but the exit status is consistent which is nice.

Here are the docs to create a script plugin ->

http://support.hyperic.com/confluence/display/DOC/Script+Plugin

For solaris monitoring in general, monitoring prtdiag and /var/adm/messages are key. I believe most (if not all) disk failures are captured in /var/adm/messages. You could also monitor your disk suite (SVM, Veritas, ZFS, etc..)

All

Low level Hardware monitoring