Dell™ Server Hardware Monitoring with OpenManage™ and Nagios®

 Contents:

1   Main features

Advanced hardware discovery
The plugin will search the monitored server for hardware components and monitor them. No need to tune the plugin to match different server models etc.
Blade detection
The plugin will identify blade servers as such and will not report fans and power supplies to be "missing" on blade systems.
Remote or local check
The plugin can check the system remotely via SNMP, or locally by using omreport commands.
Performance data
The plugin can give performance data with the -p or --perfdata switch. Performance data collected include temperatures, fan speeds and power usage (on servers that support it).
Highly customizable
A multitude of options lets the user tailor the plugin to meet his or her specific needs.

2   Basic Overview

check_openmanage is a plugin for Nagios which checks the hardware health of Dell PowerEdge (and some PowerVault) servers. It uses the Dell OpenManage Server Administrator (OMSA) software, which must be running on the monitored system. check_openmanage can be used remotely with SNMP or locally with NRPEcheck_by_ssh or similar, whichever suits your needs and particular taste. The plugin checks the health of the storage subsystem, power supplies, memory modules, temperature probes etc., and gives an alert if any of the components are faulty or operate outside normal parameters.

Nagios and check_openmanage

Storage components checked:

  • Controllers
  • Physical drives
  • Logical drives
  • Cache batteries
  • Connectors (channels)
  • Enclosures
  • Enclosure fans
  • Enclosure power supplies
  • Enclosure temperature probes
  • Enclosure management modules (EMMs)

Chassis components checked:

  • Processors
  • Memory modules
  • Cooling fans
  • Temperature probes
  • Power supplies
  • Batteries
  • Voltage probes
  • Power usage
  • Chassis intrusion
  • Removable flash media (SD cards)

Other:

  • ESM Log health
  • ESM Log content (default disabled)
  • Alert Log content (default disabled, not SNMP)

check_openmanage will identify blades and will not report "missing" power supplies, cooling fans etc. on blades. It will also accept that other components are "missing", unless for components that should be present in all servers. For example, all servers should have at least one temperature probe, but not all servers have logical drives (depends on the type and configuration of the controller).

screenshot

This nagios plugin is designed to be used by either NRPE or with SNMP. It is written in perl. In NRPE mode, it uses omreport to display status on various hardware components. In SNMP mode, it checks the same components as with omreport. Output is parsed and reported in a Nagios friendly way.

check_openmanage provides handy options that gives you the possibility to blacklist one or more hardware components that you won't fix, and to fine-tune which components are checked in the first place. Blacklisting and check control are described later in this document.

check_openmanage has been testet on a variety of Dell servers running RHEL4, RHEL5, VMware ESX and various Windows releases (with SNMP), with recent OpenManage versions. It has been tested and runs successfully on the following Dell PowerEdge models at our site: 1750, 1800, 1850, 1950, 1955, 2600, 2650, 2800, 2850, 2900, 2950, 6650, 6950, 750, 850, M600, M610, R610, R710, T710, R805, R900, R910.

3   Prerequisites

3.1   Perl interpreter

check_openmanage needs a normal perl interpreter, version 5.6.0 or later. The plugin assumes that perl is available as /usr/bin/perl, but you can easily change this as you wish by editing the first line in the script.

For SNMP, you'll also need the perl module Net::SNMP on the Nagios server (or the server running the queries). This module is not part of perl itself, but is available in all modern Linux distributions. Installing Net::SNMP is quite easy:

  • For RHEL/CentOS 5.x the best way is to use EPEL:

    rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
    yum install perl-Net-SNMP
    
  • For Fedora:

    yum install perl-Net-SNMP
    
  • For SuSE:

    rug install perl-Net-SNMP
    
  • For Debian and Ubuntu:

    aptitude install libnet-snmp-perl
    

3.2   Dell Openmanage Server Administrator

check_openmanage relies heavily on Dell Openmanage Server Administrator (OMSA) and will not function if this software is not installed on the monitored server. To obtain OMSA, go to http://support.dell.com/, select "Drivers and downloads", select your server model, and download the package called "Openmanage Server Administrator Managed Node" under "Systems Management".

Important

check_openmanage is tested with OMSA version 5.3 or later. Older versions may work, but no guarantees. YMMV.

There are also official Dell repositories for Red Hat and SuSE. For information about these, look here: http://linux.dell.com/repo/hardware/

4   Intelligent plugin

check_openmanage is an intelligent plugin. It will by itself discover which hardware components are present in the server and monitor them. It does this because it assumes that most systems administrators are lazy, and are not interested in configuring the plugin to match different server models, blade vs. standalone etc. It should just work. Missing hardware components are ignored by the plugin, unless they should be present in all servers, such as CPUs.

This comes with a price, however. For hardware components that are allowed to be missing, if OMSA doesn't display them the component will be ignored by the plugin.

The following components are allowed to be missing on all servers:

  • Amperage probes
  • Power supplies
  • Batteries
  • Intrusion detection sensor
  • Removable flash media (SD cards)

Note

As of version 3.6.0, storage controllers are no longer allowed to be missing.

In addition, the following components are allowed to be missing if the server is identified as a blade system:

  • Cooling (fans)

If you're using the plugin via SNMP, without blacklisting and without removing checks with the --check option, check_openmanage will also check the global health status for added security.

5   Multiline output

Since check_openmanage monitors several things, the plugin's output will sometime contain multiple lines. These lines will be separated by HTML linebreaks (<br/>) if run as a command within Nagios, via NRPE etc. If run from a console which has a TTY, i.e. if you log in via SSH or similar and run check_openmanage manually, the linebreaks will be regular linebreaks.

Nagios 3.x allows the following option in cgi.cfg:

# ESCAPE HTML TAGS
# This option determines whether HTML tags in host and service
# status output is escaped in the web interface.  If enabled,
# your plugin output will not be able to contain clickable links.
escape_html_tags=1

The default, as seen above in the sample cgi.cfg from the distribution, is that HTML tags are escaped. My advice is to turn this off. If not, you will see literal HTML linebreaks in the Nagios console, i.e. like this:

Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost<br/>Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

instead of this:

Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

With Nagios 3.x, plugins are allowed to output multiple lines with regular linebreaks, but only the first line is shown in the web interface (status.cgi).

6   Getting started

Attention!

This is a short HOWTO that describes how to get started with using check_openmanage. This HOWTO assumes that the prerequisites are met, and that you have a Nagios server up and running. Nagios version 3.x is assumed.

The examples below are simple examples with very basic usage of check_openmanage. There are many more or less advanced options that you might consider useful. Se theUsage section for info.

Also note that the examples below are just that: examples. They describe one way of doing things, that is simple and straightforward. There are many other ways of configuring Nagios, this is ultimately up to you.

The first thing to consider is by which mechanism you want check_openmanage to check your Dell servers. You can run the script remotely on the Nagios server, probing the Dell servers via SNMP, or you can use NRPE, check_by_ssh etc. and run the plugin locally on the Dell servers. This should be an informed choice, but you can always change it later. If you have mostly Linux boxes, you may want to use NRPE and run the plugin locally. If you have mostly Windows boxes, you may want to check via SNMP. You can easily do both, e.g. NRPE for Linux and SNMP for Windows, but you'll have to define different commands in your Nagios config for each mechanism.

6.1   Creating a hostgroup

The first thing you want to do is create a hostgroup that contains your Dell servers, if you haven't already done so. If you have very few Dell servers you can skip this step and use hosts in the service definition instead, but I think hostgroups are always better:

# hostgroup for Dell servers
define hostgroup {
    hostgroup_name  dell-servers
    alias           Dell Servers
}

6.2   Defining the hosts

You'll need a host definition for each of the servers. You probably already have this in place, but for the sake of completeness it is included in this mini-howto:

define host {
    host_name       my-server1.foo.org
    alias           my-server1
    address         192.168.10.12
    use             generic-host
    hostgroups      dell-servers
    contact_groups  example@foo.org
}

6.3   Creating a servicegroup

Next you want to create a servicegroup for this service. This is not required, but it makes things easier when you want to inspect your Dell servers via Nagios' web interface. Creating a servicegroup is simple:

# Servicegroup for Dell OpenManage
define servicegroup {
    servicegroup_name         dell-openmanage
    alias                     Dell server health status
}

The servicegroup is used later in the service definition.

6.4   Remote check via SNMP

6.4.1   Defining a command

The next step is to define a command for check_openmanage:

# Openmanage check via SNMP
define command {
    command_name    check_openmanage
    command_line    /path/to/check_openmanage -H $HOSTADDRESS$
}

(Replace /path/to with the actual path leading up to the plugin).

Note that is is a very basic example of check_openmanage usage. Refer to the usage section for info about the different options that alters the behaviour of check_openmanage.

6.4.2   Defining the service

Finally, you define the service:

# Dell OMSA status
define service {
    use                       generic-service
    hostgroup_name            dell-servers
    servicegroups             dell-openmanage
    service_description       Dell OMSA
    check_command             check_openmanage
    notes_url                 http://folk.uio.no/trondham/software/check_openmanage.html
}

The notes_url statement is optional.

6.5   Local check via NRPE

If you want to use NRPE, I assume that you have defined a check_nrpe command elsewhere in your config and are ready to use it. Usually, we don't define a command for check_openmanage when using NRPE, so we go right to the service definition:

# Dell OMSA status
define service {
    use                       generic-service
    hostgroup_name            dell-servers,!evil-dell-servers
    servicegroups             dell-openmanage
    service_description       Dell OMSA
    check_command             check_nrpe!check_openmanage
    notes_url                 http://folk.uio.no/trondham/software/check_openmanage.html
}

In this example we have opted to check the hostgroup "dell-servers", while ignoring the hostgroup "evil-dell-servers" (which contains Dell servers that can't run OMSA). Refer to the Nagios documentation to read up on hostgroups.

The NRPE config has the following:

command[check_openmanage]=/path/to/check_openmanage

(Replace /path/to with the actual path leading up to the plugin, or a correct macro such as $USER1$.)

Note that is is a very basic example of check_openmanage usage. Refer to the Usage section for info about the different options that alters the behaviour of check_openmanage.

7   Usage

7.1   Help output

The option -h or --help will give a short usage information, that includes the most commonly used options. For more information, see the manual page.

$ check_openmanage -h
Usage: check_openmanage [OPTION]...
GENERAL OPTIONS:
   -p, --perfdata      Output performance data [default=no]
   -t, --timeout       Plugin timeout in seconds [default=30]
   -c, --critical      Custom temperature critical limits
   -w, --warning       Custom temperature warning limits
   -d, --debug         Debug output, reports everything
   -h, --help          Display this help text
   -V, --version       Display version info
SNMP OPTIONS:
   -H, --hostname      Hostname or IP (required for SNMP)
   -C, --community     SNMP community string [default=public]
   -P, --protocol      SNMP protocol version [default=2]
   --port              SNMP port number [default=161]
   -6, --ipv6          Use IPv6 instead of IPv4 [default=no]
   --tcp               Use TCP instead of UDP [default=no]
OUTPUT OPTIONS:
   -i, --info          Prefix any alerts with the service tag
   -e, --extinfo       Append system info to alerts
   -s, --state         Prefix alerts with alert state
   -S, --short-state   Prefix alerts with alert state abbreviated
   -o, --okinfo        Verbosity when check result is OK
   -I, --htmlinfo      HTML output with clickable links
CHECK CONTROL AND BLACKLISTING:
   -a, --all           Check everything, even log content
   -b, --blacklist     Blacklist missing and/or failed components
   --only              Only check a certain component or alert type
   --check             Fine-tune which components are checked
   --no-storage        Don't check storage
For more information and advanced options, see the manual page or URL:
  http://folk.uio.no/trondham/software/check_openmanage.html

7.2   Local check

Run locally or via NRPE, check_openmanage will use omreport to display info on hardware components, and report the result:

$ check_openmanage
OK - System: 'PowerEdge R710', SN: 'XXXXXX', 24 GB ram (6 dimms), 1 logical drives, 2 physical drives

Any user is allowed to run omreport, so you don't need any sudo mechanisms or similar.

7.2.1   Local check on Windows

If SNMP just isn't your cup of tea, you can use check_openmanage natively on Windows by either

  • Install a Windows Perl interpreter, e.g. Strawberry Perl and use the plugin as a normal perl script, or
  • Use the file check_openmanage.exe, which is included in the ZIP achive and gzipped tarball.

The file check_openmanage.exe is a Win32 executable binary produced with Microsoft Visual Studio 2010, Strawberry Perl and the perl module PAR::Packer. See also this howto.

The Win32 executable can be dropped directly into the NSClient++ directory. Add to the file nsc.ini under [NRPE Handlers]:

command[check_openmanage]=check_openmanage.exe

The restart the nsc service. Call the plugin from Nagios as:

check_command check_nrpe!check_openmanage

7.3   Remote check

Note

If the -H or --hostname option is present, the plugin will automatically use SNMP to communicate with the monitored system.

The perl module Net::SNMP must be installed for check_openmanage to work in SNMP mode. See prerequisites.

check_openmanage is fully compatible with Nagios' embedded perl interpreter (ePN).

The plugin can query the monitored host remotely via SNMP. Prerequisites for this are that the monitored host is running an SNMP agent, that OMSA is installed and running with SNMP support, and that the Nagios server is allowed to communicate with the host over SNMP. The -H or --hostname option is needed for the hostname/IP you want to check.

$ check_openmanage -H myhost
OK - System: 'PowerEdge R710', SN: 'XXXXXXX', 24 GB ram (6 dimms), 1 logical drives, 2 physical drives

You can specify the SNMP community string (for SNMP version 1 and 2c) with the -C or --community option. Default community is set to "public" if the option is not present:

$ check_openmanage -H myhost -C mycommunity
OK - System: 'PowerEdge R710', SN: 'XXXXXXX', 24 GB ram (6 dimms), 1 logical drives, 2 physical drives

You can also specify SNMP protocol version with the -P or --protocol option. Default is 2 (i.e. SNMP version 2c) if the option is not present. Changing the SNMP protocol version is usually not needed. Note that SNMP protocol version 2c (default) is significantly faster than version 1. SNMP version 3 requires additional authentication options to be specified. Also note that Windows does not support SNMPv3 natively.

For other SNMP options, refer to the usage section and the manual page..

For details on how to enable SNMP on Windows 2003 server, refer to

Caution!

SNMP daemon on Windows is unstable with some OMSA versions

Many users have reported that the SNMP daemon on Windows dies occasionally. It seems that OMSA versions prior to 5.5.0 have this problem, while 5.5.0 (or better yet: 5.5.0.1) and later versions do not. If you're using check_openmanage with SNMP to monitor Windows hosts, make sure that you have a recent OMSA version running.

7.4   Output control

The default behaviour of the plugin is to print all alerts on separate lines with no extra fuzz:

$ check_openmanage
Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

There are several options that allows you to alter this, as listed below.

7.4.1   Prefix alerts with the service state

The -s or --state option will prefix each alert with the full service state:

$ check_openmanage -s
CRITICAL: Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
WARNING: Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

7.4.2   Prefix alerts with the service state (abbreviated)

Example output with the -S or --short-state option, which does the same, except that the service state is abbreviated to only one letter, i.e. C for CRITICALW for WARNINGetc.:

$ check_openmanage -S
C: Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
W: Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

7.4.3   Prefix alerts with the service tag

The option -i or --info will prefix all alerts with the service tag:

$ check_openmanage -i
[JV8KH0J] Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
[JV8KH0J] Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date

7.4.4   System info after the alert(s)

The option -e or --extinfo will print the server model and service tag on a separate line at the end of the alert:

$ check_openmanage -e
Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date
------ SYSTEM: PowerEdge 1950, SN: JV8KH0J

7.4.5   Custom line after the alert(s)

If this isn't exactly what you want, you can also specify your own string to be shown on a separate line at the end of the alert, with the --postmsg option:

$ check_openmanage --postmsg 'NOTE: Service tag: %s - Dell support: 555-1234-5678'
Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date
NOTE: Service tag: JV8KH0J - Dell support: 555-1234-5678

The argument is either a string with the message, or a file containing that string. You can control the format with the following interpreted sequences:

CodeReplaced with
%mSystem model
%sService tag
%bBIOS version
%dBIOS release date
%oOperating system name
%rOperating system release
%pNumber of physical drives
%lNumber of logical drives
%nLine break
%%A literal %

The full range of the control format for --postmsg is also available in the manual page.

7.4.6   Combination of output options

You can combine any of these options. A simple example:

$ check_openmanage -s -e
CRITICAL: Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
WARNING: Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date
------ SYSTEM: PowerEdge 1950, SN: JV8KH0J

A more advanced example:

$ check_openmanage -s -i --postmsg 'NOTE: Handled in RT ticket #123456'
CRITICAL: [JV8KH0J] Power Supply 0 [AC]: Presence Detected, Failure Detected, AC Lost
WARNING: [JV8KH0J] Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date
NOTE: Handled in RT ticket #123456

Which (combination) of these options you choose to use, if any, depends on how you use Nagios and your personal preference.

7.4.8   Verbosity when everything is ok

The default behaviour of the plugin is to output a single line when there are no alerts:

$ check_openmanage
OK - System: 'PowerEdge M600', SN: 'XXXXXXX', 24 GB ram (6 dimms), 1 logical drives, 2 physical drives

The option -o or --ok-info takes an integer as argument, and lets you control the amount of output given by the plugin in an "OK" state. The higher the integer, the more output. The argument is cumulative. A value of 1 outputs BIOS and firmware info on a separate line:

$ check_openmanage -H myhost -o 1
OK - System: 'PowerEdge 2950', SN: 'XXXXXXX', 4 GB ram (4 dimms), 4 logical drives, 47 physical drives
----- BIOS='2.5.0 09/12/2008', DRAC5='1.45', BMC='2.37'

A value of 2 also outputs firmware, driver etc. for storage controllers and enclosures (including backplane):

$ check_openmanage -o 2
OK - System: 'PowerEdge 2950', SN: 'XXXXXXX', 16 GB ram (8 dimms), 4 logical drives, 47 physical drives
----- BIOS='2.5.0 09/12/2008', DRAC5='1.45', BMC='2.37'
----- Ctrl 0 [PERC 6/i Integrated]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Ctrl 1 [PERC 6/E Adapter]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Ctrl 2 [PERC 6/E Adapter]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Encl 0:0:0 [Backplane]: Fw='1.05'
----- Encl 1:0:0 [MD1000]: Fw='A.04'
----- Encl 2:0:0 [MD1000]: Fw='A.04'
----- Encl 2:1:0 [MD1000]: Fw='A.04'

A value of 3 (or above) will also include the OMSA version:

$ check_openmanage -o 3
OK - System: 'PowerEdge 2950', SN: 'XXXXXXX', 16 GB ram (8 dimms), 4 logical drives, 47 physical drives
----- BIOS='2.5.0 09/12/2008', DRAC5='1.45', BMC='2.37'
----- Ctrl 0 [PERC 6/i Integrated]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Ctrl 1 [PERC 6/E Adapter]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Ctrl 2 [PERC 6/E Adapter]: Fw='6.2.0-0013', Dr='00.00.04.01-RH1'
----- Encl 0:0:0 [Backplane]: Fw='1.05'
----- Encl 1:0:0 [MD1000]: Fw='A.04'
----- Encl 2:0:0 [MD1000]: Fw='A.04'
----- Encl 2:1:0 [MD1000]: Fw='A.04'
----- OpenManage Server Administrator (OMSA) version: '5.5.0'

Most would go for the default (value of 0), i.e. a single line.

Warning

When the plugin is run locally, i.e. via NRPE, the only way for check_openmanage to determine the OMSA version is to run the command "omreport about". This command is very slow, and will effectively double the amount of time check_openmanage takes to run.

When used via SNMP, this is not an issue.

7.5   Debug output

If given option -d or --debug, check_openmanage will output messages about all the checked components, along with their respectible alert states and a unique identifier, if applicable. The identifier is used whenever you want to blacklist a component, i.e. prevent it from being checked. Blacklisting is discussed later. An example debug output is given below.

   System:      PowerEdge R710 II        OMSA version:    6.3.0
   ServiceTag:  XXXXXXX                  Plugin version:  3.6.2
   BIOS/date:   2.1.9 05/21/2010         Operation mode:  SNMPv2 UDP/IPv4
-----------------------------------------------------------------------------
   Storage Components
=============================================================================
  STATE  |    ID    |  MESSAGE TEXT
---------+----------+--------------------------------------------------------
      OK |        0 | Controller 0 [PERC H700 Integrated] is Ready
      OK |  0:0:0:0 | Physical Disk 0:0:0 [SAS-HDD 146GB] on ctrl 0 is Online
      OK |  0:0:0:1 | Physical Disk 0:0:1 [SAS-HDD 146GB] on ctrl 0 is Online
      OK |  0:0:0:2 | Physical Disk 0:0:2 [SATA-SSD 100GB] on ctrl 0 is Online
      OK |  0:0:0:3 | Physical Disk 0:0:3 [SATA-SSD 100GB] on ctrl 0 is Online
      OK |  0:1:0:4 | Physical Disk 1:0:4 [SAS-HDD 600GB] on ctrl 0 is Online
      OK |  0:1:0:5 | Physical Disk 1:0:5 [SAS-HDD 600GB] on ctrl 0 is Online
      OK |      0:0 | Logical Drive '/dev/sda' [RAID-1, 136.12 GB] is Ready
      OK |      0:1 | Logical Drive '/dev/sdb' [RAID-1, 92.62 GB] is Ready
      OK |      0:2 | Logical Drive '/dev/sdc' [RAID-1, 558.38 GB] is Ready
      OK |      0:0 | Cache Battery 0 in controller 0 is Ready
      OK |      0:0 | Connector 0 [SAS] on controller 0 is Ready
      OK |      0:1 | Connector 1 [SAS] on controller 0 is Ready
      OK |    0:0:0 | Enclosure 0:0:0 [Backplane] on controller 0 is Ready
      OK |    0:1:0 | Enclosure 0:1:0 [Backplane] on controller 0 is Ready
-----------------------------------------------------------------------------
   Chassis Components
=============================================================================
  STATE  |  ID  |  MESSAGE TEXT
---------+------+------------------------------------------------------------
      OK |    0 | Memory module 0 [DIMM_A1, 8192 MB] is Ok
      OK |    1 | Memory module 1 [DIMM_A2, 8192 MB] is Ok
      OK |    2 | Memory module 2 [DIMM_A3, 8192 MB] is Ok
      OK |    3 | Memory module 3 [DIMM_A4, 8192 MB] is Ok
      OK |    4 | Memory module 4 [DIMM_A5, 8192 MB] is Ok
      OK |    5 | Memory module 5 [DIMM_A6, 8192 MB] is Ok
      OK |    6 | Memory module 6 [DIMM_A7, 8192 MB] is Ok
      OK |    7 | Memory module 7 [DIMM_A8, 8192 MB] is Ok
      OK |    8 | Memory module 8 [DIMM_A9, 8192 MB] is Ok
      OK |    9 | Memory module 9 [DIMM_B1, 8192 MB] is Ok
      OK |   10 | Memory module 10 [DIMM_B2, 8192 MB] is Ok
      OK |   11 | Memory module 11 [DIMM_B3, 8192 MB] is Ok
      OK |   12 | Memory module 12 [DIMM_B4, 8192 MB] is Ok
      OK |   13 | Memory module 13 [DIMM_B5, 8192 MB] is Ok
      OK |   14 | Memory module 14 [DIMM_B6, 8192 MB] is Ok
      OK |   15 | Memory module 15 [DIMM_B7, 8192 MB] is Ok
      OK |   16 | Memory module 16 [DIMM_B8, 8192 MB] is Ok
      OK |   17 | Memory module 17 [DIMM_B9, 8192 MB] is Ok
      OK |    0 | Chassis fan 0 [System Board FAN 1 RPM]: 3600
      OK |    1 | Chassis fan 1 [System Board FAN 2 RPM]: 3600
      OK |    2 | Chassis fan 2 [System Board FAN 3 RPM]: 3600
      OK |    3 | Chassis fan 3 [System Board FAN 5 RPM]: 3600
      OK |    4 | Chassis fan 4 [System Board FAN 4 RPM]: 3600
      OK |    0 | Power Supply 0 [AC]: Presence detected
      OK |    1 | Power Supply 1 [AC]: Presence detected
      OK |    0 | Temperature Probe 0 [System Board Ambient Temp] reads 20 C (min=8/3, max=42/47)
      OK |    0 | Processor 0 [Intel Xeon L5640 2.27GHz] is Present
      OK |    1 | Processor 1 [Intel Xeon L5640 2.27GHz] is Present
      OK |    0 | Voltage sensor 0 [CPU1 VCORE PG] is Good
      OK |    1 | Voltage sensor 1 [CPU2 VCORE PG] is Good
      OK |    2 | Voltage sensor 2 [CPU2 0.75 VTT CPU2 PG] is Good
      OK |    3 | Voltage sensor 3 [CPU1 0.75 VTT CPU1 PG] is Good
      OK |    4 | Voltage sensor 4 [System Board 1.5V PG] is Good
      OK |    5 | Voltage sensor 5 [System Board 1.8V PG] is Good
      OK |    6 | Voltage sensor 6 [System Board 3.3V PG] is Good
      OK |    7 | Voltage sensor 7 [System Board 5V PG] is Good
      OK |    8 | Voltage sensor 8 [CPU2 MEM PG] is Good
      OK |    9 | Voltage sensor 9 [CPU1 MEM PG] is Good
      OK |   10 | Voltage sensor 10 [CPU2 VTT PG] is Good
      OK |   11 | Voltage sensor 11 [CPU1 VTT PG] is Good
      OK |   12 | Voltage sensor 12 [System Board 0.9V PG] is Good
      OK |   13 | Voltage sensor 13 [CPU2 1.8 PLL  PG] is Good
      OK |   14 | Voltage sensor 14 [CPU1 1.8 PLL PG] is Good
      OK |   15 | Voltage sensor 15 [System Board 8.0 V PG] is Good
      OK |   16 | Voltage sensor 16 [System Board 1.1 V PG] is Good
      OK |   17 | Voltage sensor 17 [System Board 1.0 LOM PG] is Good
      OK |   18 | Voltage sensor 18 [System Board 1.0 AUX PG] is Good
      OK |   19 | Voltage sensor 19 [System Board 1.05 V PG] is Good
      OK |   20 | Voltage sensor 20 [PS 1 Voltage] is 234.000 V
      OK |   21 | Voltage sensor 21 [PS 2 Voltage] is 228.000 V
      OK |    0 | Battery probe 0 [System Board CMOS Battery] is Presence Detected
      OK |    0 | Amperage probe 0 [PS 1 Current] reads 0.4 A
      OK |    1 | Amperage probe 1 [PS 2 Current] reads 0.4 A
      OK |    2 | Amperage probe 2 [System Board System Level] reads 175 W
      OK |    0 | Chassis intrusion 0 detection: Ok (Not Breached)
      OK |    0 | SD Card 0 [vFlash] is Absent
-----------------------------------------------------------------------------
   Other messages
=============================================================================
  STATE  |  MESSAGE TEXT
---------+-------------------------------------------------------------------
      OK | ESM log health is Ok (less than 80% full)

The debug output will vary from model to model and server to server. Among other things, it depends on what components exist in the server in the first place. The above example is typical, of a regular PowerEdge R710 server with a few disks.

Warning

The option -d or --debug is intended for diagnostics and debugging purposes only. Do not use this option from within Nagios, i.e. in your Nagios config.

7.6   Custom temperature thresholds

OMSA temperature thresholds

Openmanage (OMSA) has its own temperature thresholds. You can easily view these with omreport:

myhost # omreport chassis temps
Temperature Probes Information
------------------------------------
Main System Chassis Temperatures: Ok
------------------------------------
Index                     : 0
Status                    : Ok
Probe Name                : System Board Ambient Temp
Reading                   : 17.0 C
Minimum Warning Threshold : 8.0 C
Maximum Warning Threshold : 42.0 C
Minimum Failure Threshold : 3.0 C
Maximum Failure Threshold : 47.0 C

Check_openmanage will also output the OMSA thresholds when run in debug mode, example:

$ check_openmanage -H myhost --only temp -d
   System:      PowerEdge M600
   ServiceTag:  XXXXXXX                  OMSA version:    6.1.0
   BIOS/date:   2.1.4 08/15/2008         Plugin version:  3.5.5
-----------------------------------------------------------------------------
   Chassis Components
=============================================================================
  STATE  |  ID  |  MESSAGE TEXT
---------+------+------------------------------------------------------------
      OK |    0 | Temperature Probe 0 [System Board Ambient Temp] reads 17 C (min=8/3, max=42/47)

As you can see the thresholds are given inside the parentheses at the end of the line. Check_openmanage will treat these values as absolute, i.e. if the temperature is beyond these limits, the plugin will issue an alert.

7.6.1   Narrowing the field

OMSA and check_openmanage temperature thresholds

If you wish to narrow the field of OK temperatures (e.g. setting the warning limit for the maximum temperature lower than the OMSA threshold), you can override the OMSA temperature warning and critical thresholds with the -w|--warning and -c|--critical options:

$ check_openmanage -H myhost -w 0=30 -c 0=40
Temperature Probe 0 [System Board Ambient Temp] reads 31 C (custom max=30)

The option takes either a string or a file containing the string with the limits. Syntax is id1=max[/min],id2=max[/min],.... Each of these options can be specified multiple times if needed.

You can also specify a custom minimum temperature:

$ check_openmanage -H myhost -w 0=30/15 -c 0=40/10
Temperature Probe 0 [System Board Ambient Temp] reads 14 C (custom min=15)

7.6.2   Expanding the field

OMSA expanded temperature thresholds

If you wish to expand the field, the check_openmanage options mentioned above won't help you. In this case, use omconfig to adjust the warning thresholds of OMSA itself. If, for the machine above, we wanted the warning threshold to be 45 degrees instead of 43 degrees (the default), we would say:

myhost # omconfig chassis temps index=0 maxwarnthresh=45
Temperature probe warning threshold(s) set successfully.

If you want to reset the thresholds to the system default, use this command:

myhost # omconfig chassis temps index=0 warnthresh=default
Temperature probe warning threshold(s) set successfully.

Note that only the OMSA warning thresholds can be adjusted like this. The failure/critical thresholds are absolute and can't be set manually.

7.7   Blacklisting

You can blacklist failed/missing components that you won't fix. Blacklisting means that the particular component is never checked. The option -b|--blacklist is used for blacklisting, takes either a string or file as input, and can be specified multiple times:

$ check_openmanage -s -H myhost
WARNING: Controller 0 [PERC 6/i Integrated]: Driver '00.00.03.15-RH1' is out of date
WARNING: Controller 1 [PERC 6/E Adapter]: Driver '00.00.03.15-RH1' is out of date
$ check_openmanage -s -H myhost -b ctrl_driver=0,1
OK - System: 'PowerEdge 1950', SN: 'XXXXXXX', 4 GB ram (4 dimms), 2 logical drives, 8 physical drives

Syntax for blacklisting is:

component1=<all|id1,id2,...>/component2=<all|id1,id2,..>/...

I.e. a single line separated by slashes. The component names are listed in the manual page. Here is the list:

ComponentComment
ctrlController
ctrl_fwSuppress the "special" warning message about old controller firmware. Use this if you can't or won't upgrade the firmware.
ctrl_driverSuppress the "special" warning message about old controller driver. Particularly useful on systems where you can't upgrade the driver.
ctrl_stdrSuppress the "special" warning message about old Windows storport driver.
pdiskPhysical disk.
vdiskLogical drive (virtual disk)
batController cache battery
bat_chargeIgnore warnings related to the controller cache battery charging cycle, which happens approximately every 40 days on Dell servers. Note that using this blacklist keyword makes check_openmanage ignore non-critical cache battery errors.
connConnector (channel)
enclEnclosure
encl_fanEnclosure fan
encl_psEnclosure power supply
encl_tempEnclosure temperature probe
encl_emmEnclosure management module (EMM)
dimmMemory module
fanFan (Cooling device)
psPowersupply
tempTemperature sensor
cpuProcessor (CPU)
voltVoltage probe
bpSystem battery
ampAmperage probe (power consumption monitoring)
intrIntrusion sensor
sdRemovable flash media (SD card)

The component IDs are listed in the debug output:

$ check_openmanage -H myhost -d
   System:      poweredge 2850
   ServiceTag:  XXXXXXX                  OMSA version:    6.1.0
   BIOS/date:   A06 10/03/2006           Plugin version:  3.5.4
-----------------------------------------------------------------------------
   Storage Components
=============================================================================
   LVL   |    ID    |  STATE
---------+----------+--------------------------------------------------------
      OK |        0 | Controller 0 [PERC 4e/Di] is Ready
CRITICAL |    0:0:0 | Physical disk 0:0 [Maxtor ATLAS15K2_146SCA, 146GB] on controller 0 needs attention: Failed
      OK |    0:0:1 | Physical disk 0:1 [146GB] on controller 0 is Online
[...etc...]

If we in the above example wished to blacklist the failed disk, we would use the following as input to the -b|--blacklist option:

pdisk=0:0:0

Now the failed disk is not checked at all, and Nagios is happy.

You can also use all instead of the component IDs. For example, if you want to ignore old drivers for all controllers:

$ check_openmanage -b ctrl_driver=all

7.8   Check control

check_openmanage lets you fine-tune which components you want to check via the --check option. By default almost everything is checked (as listed in the basic overview).

The syntax for the --check option is as follows:

component1=<0|1>,component2=<0|1>,...

A value of 0 will turn checking off for the specified component, while a value of 1 will turn checking on. Example:

$ check_openmanage --check storage=0,esmlog=1

In the above example, we turn off checking of the storage subsystem, and adds checking of the ESM log content. You can specify the --check option multiple times. The following example will have the same effect as the one above:

$ check_openmanage --check storage=0 --check esmlog=1

The argument to the --check option can also be a file containing the actual arguments. If we for example make a file /etc/check_openmanage.check that contains the following:

storage=0,esmlog=1

The following example will then have the same effect as the other examples in this paragraph:

$ check_openmanage --check /etc/check_openmanage.check

If the specified file (here /etc/check_openmanage.check) doesn't exist, it is simply ignored. You can also mix "check files" and actual arguments:

$ check_openmanage  --check /etc/check_openmanage.check --check power=0

This option is versatile for a reason. If, for example, you're running check_openmanage locally via NRPE, and want to have the same command and NRPE config for all servers, you can specify a "check file" and still be able to control the checks performed individually on each server.

The list of legal parameters to the --check option is the same as for the --only option described below, except for "critical" and "warning". The full list is also given in themanual page.

7.9   Only check one component type or alert type

You can use the option --only to specify what type of component is desired. If this option is used, only that type of component is checked. Example:

$ check_openmanage --only voltage
VOLTAGE OK - 14 voltage probes checked

The following keywords are accepted by the --only option:

KeywordEffect
criticalOnly output critical alerts. It is possible to use the --check option together with this option to adjust checks.
warningOnly output warning alerts. It is possible to use the --check option together with this option to adjust checks.
chassisOnly check chassis components, i.e. everything but storage and log content.
storageOnly check storage components
memoryOnly check memory modules
fansOnly check fans
powerOnly check power supplies
tempOnly check temperatures
cpuOnly check processors
voltageOnly check voltage probes
batteriesOnly check batteries
amperageOnly check power usage
intrusionOnly check chassis intrusion
sdcardOnly check removable flash media
esmhealthOnly check ESM log health
esmlogOnly check ESM log content
alertlogOnly check alertlog content

A couple of other examples:

$ check_openmanage --only storage -H myhost
STORAGE OK - 4 physical drives, 2 logical drives
$ check_openmanage --only fans -H myhost
FANS OK - 6 fan probes checked
$ check_openmanage --only memory -H myhost
MEMORY OK - 6 memory modules, 24576 MB total memory

Note

The option --only replaces the usage of different basenames that occured in previous versions of check_openmanage.

7.10   Check everything

Use the option -a or --all to turn on checking of everything, even log content:

$ check_openmanage -a
ESM log content: 3 critical, 0 non-critical, 4 ok

8   A note about charging cache batteries

The Dell RAID controllers usually have a cache battery, which is useful in case of a sudden power outage. This battery will on occasion drain out, and needs recharging. The hardware takes care of this, but gives a warning. What happens is this (assuming that the storage subsystem is checked, which is the default):

  1. The hardware senses that the battery has low power, reports it and as a result check_openmanage outputs the following:

    Cache battery 0 in controller 0 is Power Low [probably harmless]
    
  2. After a while, the battery enters "learning state", where the battery learns its capacity, and check_openmanage will report this as:

    Cache battery 0 in controller 0 is Learning (Active) [probably harmless]
    
  3. The battery then enters a recharge state, which check_openmanage will also report:

    Cache battery 0 in controller 0 is Charging [probably harmless]
    

One could argue that check_openmanage should simply ignore these warnings, as they occur regularly on all controllers with a cache battery. They could be viewed as informational. However, for the odd case where the charge cycle never finishes, the warnings are reported by check_openmanage, but with an included note that says that the warning is probably harmless.

Note

You can use the blacklist keyword bat_charge to disable messages about the battery charge cycle.

If the state never changes, e.g. if the warning persists for days, you should act on it. Otherwise, the warnings can be ignored.

9   Performance data

Important

The performance output from check_openmanage has changed slightly as of version 3.5.7. If you are using PNP4Nagios and are upgrading to 3.5.7 or later, you should upgrade the PNP4Nagios template as well. Note also that the latest template is included in PNP4Nagios 0.6.3 and later.

check_openmanage will output performance data if the --perfdata or -p option is used. The performance data gathered will vary depending on the type and model of the monitored server. An example graph using PNP4Nagios is given below.

pnp4nagios

The template used to generate these graphs are available as check_openmanage.php in the downloadable ZIP archive and tarball.

10   Download

10.1   Latest version

Packaged

The RPM packages are created on RHEL5. They will install and work fine on RHEL/CentOS 4, 5, 6 and recent Fedoras. Not sure about SuSE and various other RPM-based distros. The only difference between the i386 and x86_64 RPMs is the install path (/usr/lib vs. /usr/lib64).

The Debian package should work on Debian and Ubuntu. Thanks to Morten Werner Forsbring for help with the Debian package.

Single files

10.2   Changelog / Old versions

VersionDateChanges
3.6.42011-01-04
  • Minor feature enhancements
  • Added more robustness wrt. values from OMSA obtained via SNMP, to avoid internal errors where non-important values are missing.
  • The RPM files are changed to be compliant with Fedora/EPEL guidelines. The new RPM file "nagios-plugins-check-openmanage" obsoletes the previous one "check_openmanage".
3.6.32010-12-13
  • Minor feature enhancements
  • A few compatibility fixes for OMSA 6.4.0 were added
3.6.22010-11-25
  • Minor feature enhancementsMinor bugfixes
  • Added support for IPv6 when checking via SNMP. IPv6 can be turned on with the option -6 or --ipv6. The default is IPv4 if the option is not present.
  • Added support for TCP when checking vis SNMP. The option --tcp can be used to turn on TCP. The default transport protocol is UDP if the option is not present.
  • The mode of operation (local or SNMP) is shown in the debug output. If SNMP is used, the debug output will also show the SNMP protocol version, IP version and transport protocol (UDP or TCP).
  • Amperage probe status via SNMP is of type "probe status", not regular status. This has been fixed.
  • Massive overall robustness improvements to handle OMSA bugs where some information from OMSA is missing.
  • Memory module enumeration via SNMP changed somewhat to reflect enumeration provided by omreport. This ensures that the plugin's output is identical in SNMP or local mode wrt. dimms IDs.
  • Fan enumeration via SNMP changed somewhat to reflect enumeration provided by omreport. This ensures that the plugin's output is identical in SNMP or local mode wrt. fan IDs.
3.6.12010-11-02
  • Minor feature enhancementsMinor bugfixes
  • Included new check for SD cards. Newer servers such as the R710 can have SD cards installed, these should be monitored. The SD card check is on by default. A new blacklisting keyword "sd" has been added. The SD card check can be turned off with "--check sdcard=0".
  • Handle special cases where power monitoring capability is disabled due to non-redundant and/or non-instrumented power supplies.
  • For physical disks probed via SNMP, check that values for vendor, product ID and capacity is available before attempting to display those values.
  • If a physical disk is in sufficiently bad condition, the vendor field reported by OMSA may be empty. The plugin now handles this situation without throwing an internal error.
3.6.02010-08-30
  • Major feature enhancementsMinor bugfixes
  • Storage is no longer allowed to be absent. If the plugin doesn't find a storage controller, it will give an alert. For diskless systems or servers without a Dell controller that OMSA recognizes you will now have to specify "--no-storage" or "--check storage=0" to work around this.
  • Added an option "--no-storage", which is equivalent to the general option "--check storage=0".
  • Report the system revision (if applicable) wherever the model name is printed. E.g. "PowerEdge 2950 III" instead of "PowerEdge 2950".
  • Small change in search path for omreport: The new location for OMSA 6.2.0 and later on Linux will be attempted first.
  • Small bugfix for the "--check" parameter, if the argument is a filename. The file could not contain a linebreak, this has been fixed.
3.5.102010-07-14
  • Minor feature enhancementsMinor bugfixes
  • If a physical disk is a hot spare, display this information in the debug output
  • Report the bus protocol (e.g. SAS, SATA) and media type (e.g. HDD, SDD) for physical disks in the debug output, if applicable
  • Minor fix for 100GB physical disks, write "100GB" instead of "99GB"
  • SNMP: Use new features of OMSA 6.3.0 to display occupied and total slots in storage enclosures, if applicable. This information is not available with omreport and check_openmanage will not display this info in local mode.
  • SNMP: Added new processor IDs from the OMSA 6.3.0 MIBs
  • SNMP: Use connection tables in a proper way to determine controller and enclosure IDs, for use with physical disks and enclosure components (fan, temp sensors etc.). This fixes a long standing bug for servers with more than one controller, if checked via SNMP.
  • SNMP: Use the nexus ID as last resort to find the controller for physical disks. Workaround for older, broken OMSA versions.
  • SNMP: Identify enclosures (e.g. '2:0:0') properly so that the reporting with SNMP corresponds to the same report with omreport.
  • SNMP: added a couple of workarounds for pre-historic OMSA versions
3.5.92010-06-30
  • Minor feature enhancementsMinor bugfixes
  • More fine-grained reporting of temperature warnings for enclosure temperature probes.
  • Max/min temperature limits for enclosure temp probes are reported in the debug output
  • Report enclosure temperature probes that are "Inactive" as ok
  • Don't try to print out the reading of enclosure temperature probes if the reading doesn't exist or is not an integer
  • Report enclosure EMMs that are "Not Installed" as ok, instead of critical
  • Corrected typo in the PNP4Nagios template
3.5.82010-06-17
  • Minor feature enhancements
  • Remove reporting of which controller a logical drive is "attached" to, since this information can't be reliably extracted via SNMP.
  • avoid collecting Lun ID via SNMP for virtual disks, we don't use it
  • report total memory and number of dimms in the ok output
  • difference in reporting if amperage probes have discrete readings
  • workaround for broken amperage probes
  • added workaround for bad temperature probes that yields no reading in SNMP mode
  • get OMSA version via SNMP slightly more efficiently
3.5.72010-03-19
  • Minor feature enhancementsMinor bugfixes
  • Added robustness for received SNMP values that are not defined in the MIB. Instead of throwing a perl warning when this happens, the plugin will not report the undefined value.
  • Defined "Replacing" as a defined state for physical disks in SNMP mode, even though this state is not defined in the MIB. It is reported as such by omreport.
  • Physical disk brand/model is now reported when the state of the disk is "Rebuilding" or "Replacing".
  • The state of a physical disk is reported in parentheses when predictive failure is detected. It is useful to know if a disk is online, offline, spare or even failed when predictive failure is reported.
  • Handling of physical disk predictive failure has been improved overall.
  • Refactoring of the perfdata code. In conformance with the plugin development guidelines, the UOM (unit of measure) previously reported in the perfdata output has been removed.
  • The -p or --perfdata option now takes an optional agrument "minimal", which triggers shorter names for the perfomance data labels. This shortens the output and is a workaround for systems where the amount of output exceeds the 1024 char limit of NRPE.
  • The PNP4Nagios template has been updated. Users of check_openmanage and PNP4Nagios are advised to upgrade. This version of check_openmanage needs the new template.
  • Lots of other small improvements and updates.
3.5.62010-02-23
  • Minor feature enhancements
  • New option --use-get_table is added as a workaround for SNMPv3 on Windows using net-snmp. This option will make check_openmanage use the Net::SNMP function get_table() instead of get_entries() to collect information via SNMP.
  • Include a blacklisting option ctrl_pdisk which takes the controller number as argument. This blacklisting option only works with omreport and is a workaround for broken disk firmwares that contain illegal XML characters. These characters makes openmanage barf and exit with an error. Thanks to Bas Couwenberg for a patch.
  • If the blacklisting keyword "all" is supplied for a component type, that component type is not checked at all, i.e. the commands are never executed. This will make check_openmanage execute faster if blacklisting is heavily used.
  • Option --htmlinfo now has a shorter equivalent -I
  • The option --short-state now has a shorter equivalent -S
3.5.52010-01-22
  • Minor bugfixes
  • Fixed an SNMP bug where the plugin didn't handle OID indexes that were not sequential. Thanks to Gianluca Varenni for reporting.
  • Fixed an SNMP bug when checking old hardware such as the PE 2650 and PE 750. The controller id for physical drives were collected and displayed incorrectly. This release uses an additional OID to fetch this info, which would otherwise be unavailable. Thanks to Gianluca Varenni for reporting this bug.
  • Should use %snmp_probestatus, not %snmp_status when checking the status of voltage probes. Thanks to Ken McKinlay for a patch.
  • Fix when identifying blades via SNMP with very old OMSA versions. Patch from Ken McKinlay.
  • Better way of finding the ID of physical drives via SNMP
3.5.42010-01-13
  • Minor feature enhancements
  • Added support for storport driver version for controllers, only applicable on servers running Windows. A new blacklisting keyword for suppressing storport driver messages was added.
  • The "all" keyword in blacklisting is now case insensitive.
  • More fine-grained reporting in the rare case where a controller battery fails during learning and charging states.
  • New improved way of reporting perl warnings during execution of the plugin.
3.5.32009-12-17
  • Minor bugfixes
  • Fix for path to omreport on Linux with OMSA 6.2.0
  • A couple of other small fixes
3.5.22009-11-17
  • Minor bugfixes
  • Fix for undefined device name for logical drives (thanks to Pontus Fuchs for a patch)
  • Fixed a bug in the PNP4Nagios template, that prevented the template from working with PNP4Nagios 0.6. Thanks to the PNP4Nagios team for the fix.
  • Other small fixes
3.5.12009-10-22
  • Minor feature enhancements
  • CPU type, family etc. are now reported in case of a CPU failure (and in the debug output)
  • The debug output now reports Openmanage version and plugin version
3.5.02009-10-13
  • Major feature enhancements
  • New option -a or --all turns on checking of everything
  • The manual page (POD info) is removed from the script and is now in a separate file, to make check_openmanage fully ePN compatible
  • ePN is no longer disabled by default, check_openmanage no longer has an opinion on whether it should run under ePN or not
  • The -m or --man option is no longer available
  • The option -v or --verbose is renamed to -d or --debug, which makes more sense wrt. its usage
  • The -g or --global option is removed. Checking the global health status is now default if applicable
  • Checking intrusion detection is now turned on by default
  • The obsolete option --snmp is removed
  • The option --state now has a shorter equivalent -s
  • The basename stuff and options --only-critical and --only-warning are now replaced by an option --only
  • If plugin is run by Nagios, redirect stderr to stdout, useful for detecting errors within the plugin
  • Added option --omreport, that lets the user specify the full path to the omreport binary
  • Added non-8bit-legacy default search paths for omreport.exe for Windows boxen
  • Minor changes to the plugin output, for consistency
  • New blacklisting keyword bat_charge disables warning messages related to controller cache battery charging. Thanks to Robert Heinzmann for a patch.
  • For blacklisting, the component ID kan now be "ALL", in which all components of that type is blacklisted.
  • Man page is moved to section 8
3.4.92009-08-07
  • Minor bugfixes
  • Fixed a bug that could cause errors and weird results when checking cooling devices (fans) via SNMP. Thanks to Ken McKinlay for spotting this bug and reporting it.
3.4.82009-07-31
  • Minor feature enhancements
  • For failed physical drives, check_openmanage will now output the drive's vendor, model and size in GB or TB.
3.4.72009-07-24
  • Minor feature enhancements
  • The -s|--snmp option was redundant and no longer does anything. SNMP is triggered automatically if the -H|--hostname option is present. The -s|--snmp option is kept for compatibility, but has no effect.
3.4.62009-07-07
  • Minor feature enhancements
  • Added support for performance data (temperatures) from attached storage enclosures such as the MD1000
3.4.52009-06-22
  • Minor bugfixes
  • Fixed a regression in the --htmlinfo option when it is not supplied with an argument
3.4.42009-06-22
  • Minor feature enhancements
  • New option --htmlinfo adds clickable HTML links in the plugin's output
3.4.32009-06-11
  • Minor bugfixes
  • Fixed a regression bug in CPU and power supply reporting that only affects verbose output
  • If blacklisting is used, the global health check (via the --global option) is now negated. Checking the global health doesn't make sense when one or more components is blacklisted. Thanks to Rene Beaulieu for reporting this bug
  • The PNP4Nagios template is now included in the tarball and zip archive
3.4.22009-06-03
  • Minor feature enhancements
  • Improved memory error reporting, when using omreport
  • Collect performance data from pwrmonitoring (amperage probes) that were previously ignored when using omreport
3.4.12009-05-25
  • Minor feature enhancements
  • Improved memory error reporting, when using SNMP
  • Other small ehnancements
3.4.02009-05-25
  • Major feature enhancements
  • The plugin is now compatible with the Nagios embedded Perl interpreter (ePN) in theory. However, the plugin will not not use ePN by default. We don't want any "accidents".
  • License is now GPLv3, previously only specified as "GPL"
  • New options --only-critical and --only-warning. With these options the plugin will only print critical or warning alerts, respectively.
  • Bugfixes and speed enhancements in the storage section, when checking enclosure components via omreport
  • The -o|--okinfo option is now less verbose and more to the point
  • Lots of code refactoring for readability, maintainability and robustness
3.3.22009-05-05
  • Fixed a bug in the storage section, when checking controllers. This is an obscure bug that only manifests itself in the odd case where a server has multiple controllers, and one of the controllers are missing some of the OIDs, in which case these OIDs will be missing for the other controllers as well. The change is minor and only includes using get_table() instead of get_entries() to collect the SNMP result. Thanks to Stephan Bovet for reporting this bug.
3.3.12009-04-28
  • The --perfdata option can now optionally take an argument "multiline", which makes the plugin produce multiline performance data output in a Nagios 3.x way. Not really needed, but the plugin output is prettier.
  • Added comment within the 10 first lines to disable the nagios embedded perl (ePN) interpreter by default for Nagios 3.x
  • Improvements in the performance data output. Units are now included
3.3.02009-04-07
  • Feature Release
  • Added --global option, which turns on checking of everything. If used with SNMP, the global system health status is also probed, to protect the user against bugs in the plugin. If used with omreport, the overall chassis health is used.
  • Support for SNMP version 3
  • New check added: esmhealth. This checks the overall health of the ESM log, i.e. the fill grade. More than 80% means a warning message
  • Fixed alert log reporting to use the same format as for the ESM log
  • Output messages are now sorted by severity
  • Minor changes in how out-of-date controller firmware/driver is reported
  • Code refactoring and cleanup
3.2.72009-03-29
  • Use "omreport about" to collect OMSA version. Slightly faster than "omreport system version". This should give a small speedup in certain configurations
  • Fixed typo in output when a logical drive is rebuilding. Thanks to Andreas Olsson for reporting
  • Improved reporting of ESM log content
  • Added omreport.sh as alternate omreport path
  • Lots of other small fixes and enhancements

Plus: A few changes to make the plugin work with old PowerEdge models (e.g. 2550, 2450) and/or old OMSA versions (e.g. version 4.5):

  • Use the chassisModelName OID to determine if SNMP works (instead of BaseboardType)
  • No longer require a response when checking baseboard type via SNMP. If there is no response, we assume that we're not dealing with a blade server

Thanks to Christian McHugh for help with testing and debugging this stuff

3.2.62009-03-05
  • Use omreport system operatingsystem to collect OS info, instead of omreport system version which is incredibly slow. This should speed things up in certain configurations.
  • A few speedups, don't collect information that isn't needed
  • Man page fixes
3.2.52009-02-24
  • New option --linebreak to specify the separator between line in case of multiline output
  • Added support for 64bit Windows. Thanks to Patrick Hemmen for a patch
  • [Patrick Hemmen] Added install.bat for Windows installation
  • [Patrick Hemmen] Improvements on install.sh. Will now install in /usr/lib64 for x86_64
  • RPMs are now architecture dependent, because of different libdir
3.2.42009-02-17
  • New option -o|--ok-info to display extra information when everything is ok. The plugin can now display storage firmware and driver info, DRAC and BMC firmware, and OMSA version
  • Support for setting custom minimum temperature thresholds via the -c|--critical and -w|--warning options
  • Better and more detailed temperature error reporting
  • Bugfix in the amperage report (including performance data). The plugin now takes into account the correct unit and measurement for amperage probes (other than watts)
  • New option --port lets the user specify the remote SNMP port number
3.2.32009-02-09
  • Regression fix: Use the older Processor Device SNMP OIDs for older PowerEdge models, that don't have the new Processor Device Status OIDs. Thanks to Nicole Hähnel for reporting this bug.
  • Default output (when there are no alerts) now shows RAC firmware, BMC firmware, info about controllers and enclosures (firmware, driver).
3.2.22009-02-03
  • Regression fix: Ignore unoccupied CPU slots with SNMP probing. This fixes a bug introduced in versjon 3.2.1, which would output something like this if one or more CPU slots were empty: CPU 1 needs attention ()
3.2.12009-02-03
  • Use Processor Device Status Table OIDs instead of Processor Device Table when checking CPUs via SNMP
  • Bugfix: don't report throttled CPUs as warnings when checking via SNMP (same as for checking locally)
3.2.02009-01-27
  • Feature release
  • New options --state and --short-state for displaying service state along with the alert
  • Lots of small fixes for code readability and maintainability
3.1.12009-01-12
  • Support for running natively on Windows (using omreport.exe). Thanks to Peter Jestico for a patch.
  • Support for compiled Windows version, i.e. check_openmanage.exe is now a legal script name.
  • Exit with error if script basename is illegal/unknown
  • Various small fixes
3.1.02008-12-26
  • Feature Release: Alternate basenames
  • Use of alternate basenames for checking only one class of components
  • Added support for checking the ESM log via SNMP
  • Code refactoring for robustness and maintainability
  • Numerous small fixes and enhancements
  • Added install script in distribution tarball
3.0.22008-12-20
  • The script no longer aborts if it can't get system information via SNMP. Give a warning instead, as this is not a critical error
  • Increased robustness when checking controllers
3.0.12008-12-11
  • Man page fix in the 'check' section. Thanks to Ansgar Dahlen for reporting this.
  • Allow invalid command error from 'omreport chassis pwrmonitoring'
  • Various small fixes
3.0.02008-12-04
  • Major release
  • Use unique IDs for storage components with regard to blacklisting, which means that the blacklisting API has changed
  • Added checks for storage components: connectors (channels), enclosures, enclosure fans, enclosure power supplies, enclosure temperature probes and enclosure management modules (EMMs)
  • Improved verbose output
  • New option -t|--timeout for setting the plugin timeout
  • New option -w|--warning for setting custom temperature warning thresholds
  • New option -c|--critical for setting custom temperature critical thresholds
  • Option --check can no longer be specified in its short form (-c)
  • Code cleanup and improvements
2.1.12008-11-24
  • The workaround for the OMSA bug introduced in OMSA 5.5.0 didn't take multiple controllers into account. This has been fixed.
2.1.02008-11-19
  • Feature Release: output control
  • New option -i|--info prefixes all alerts with the service tag
  • New option -e|--extinfo gives and extra line of output in case of an alert (model and service tag)
  • New option --postmsg lets the user specify a post message string, with info such as model, service tag etc.
  • Options -b|--blacklist and -c|--check can now be specified multiple times (actually quite useful)
2.0.92008-11-17
  • Slightly improved output for alerts on logical drives (vdisks)
  • Now shows a rebuilding physical disk as a warning, as this is usually accompanied by a degraded vdisk. Previous versions didn't show this at all (omreport classifies it as "OK").
2.0.82008-11-14
  • Slightly improved output for charging controller batteries
2.0.72008-11-12
  • Bugfix for reporting physical drives with predictive failure (both via NRPE and SNMP)
2.0.62008-10-30
  • Fix bug in option handling (ambiguous options)
  • Slightly improved output if checking the storage subsystem is turned off
  • Don't complain if there are no logical drives. This is OK. Thanks to Jamie Henderson for reporting this
2.0.52008-10-29
  • Fix bug in SNMP status level table
2.0.42008-10-29
  • Added workaround for a BUG introduced in OpenManage 5.5.0. OM sometimes adds a newline in the controller driver version name, which leads to problems parsing the output. Thanks to Hiren Patel for bringing this to my attention.
2.0.32008-10-28
  • (snmp) Improved handling of cases where OM is not working properly
2.0.22008-10-27
  • Fixed issue where controller number for physical disks can't be established via SNMP (now identifies as controller no. -1)
2.0.12008-10-23
  • Correctly identifies and reports error condition in which OpenManage has stopped working (it happens)
2.0.02008-10-23
  • Major release: SNMP support
  • Same options for checking, blacklisting etc. supported with SNMP
  • Same output with SNMP as with NRPE
1.2.12008-09-25
  • Collects performance data if the option -p | --perfdata is supplied. The following data is collected:
    • Temperatures in Celcius
    • Fan speeds in rpm
    • Power consumption in Watts (on systems that support it)
  • New blacklisting directives ctrl_fw and ctrl_driver added. Suppresses the "special" warning messages concerning outdated controller firmware and driver. Useful if you can't or won't upgrade.

1.1.22008-08-06
  • Bugfix release. Fix getting system model and serial number for newer blades
1.1.12008-08-06
  • Three new checks added:
    • System battery probes (typical CMOS battery). Newer poweredge models have these
    • Power consumption monitoring (if the server supports it)
    • ESM log, with same functionality as the alert log check. Disabled by default.
1.1.02008-08-04
  • Internal refactoring: use ssv-formatted output from openmanage, resulting in slightly faster execution and increased robustness.
  • If /usr/bin/omreport doesn't exist, try /opt/dell/srvadmin/oma/bin/omreport.
  • Allow for no instrumented/redundant power supplies. Needed on low-end poweredge models and blades.
1.0.32008-07-25
  • Openmanage reports non-critical warning about throttled CPUs on new hardware models. Most og us use ondemand CPU frequency scaling (with throttled CPUs as a result). This specific non-critical warning (CPU Throttled) is ignored from now on.
  • Remove superfluous Celcius sign when reporting temperatures.
1.0.22008-07-25
  • Accommodate blade systems with no fans or powersupplies, i.e. accept errors from omreport when trying to check fans or powersupplies on blade servers.
  • Accommodate newer hardware with slightly different omreport options. Use the newer options if they exist. Not really necessary yet, but deprecated options may be removed in future versions of Dell OpenManage.
1.0.12008-07-18
  • When everything is OK, check_openmanage now outputs the same info as Gerhard Lausser's excellent check_hpasm plugin does for HP servers:

    OK - System: 'poweredge 2850', S/N: 'XXXXXXX', ROM: 'A06 10/03/2006', hardware working fine, 2 logical drives, 4 physical drives
    
1.0.02008-07-15
  • Initial release

11   Frequently Asked Questions (FAQ)

11.1   General

11.1.1   Why did you make check_openmanage?

I wanted a monitoring tool for our Dell servers that was as good as Gerhard Lausser's check_hpasm plugin is for HP servers. None of the existing Dell plugins offered the features and detailed output that I needed, so I made my own plugin. After a while, I decided to share it with the community, and I've never regretted this decision. Constructive feedback from users have improved this plugin immensely.

11.1.2   How does check_openmanage compare to other Dell plugins?

Like most Nagios plugins that check the hardware health of Dell servers, check_openmanage uses OpenManage Server Administrator (OMSA) from Dell. The difference is in the level of detail in the output. Most other plugins simply check the status level of the subsystems and return the status code, while check_openmanage also tries to figure out exactly what is wrong and report it in a detailed and concise manner.

11.1.3   Why both 32bit and 64bit RPM packages?

This is a perl script, which is architecture independent by itself, so one would think that the RPM arch would be "noarch". However, since the official Nagios plugins package (and others, e.g. Fedora) put Nagios plugins under /usr/lib for 32bit and /usr/lib64 for 64bit platforms, I wanted to apply this to check_openmanage as well. This is the onlydifference between the two RPM packages.

11.1.4   I don't like check_openmanage, are there other Dell plugins that you would recommend?

There are many plugins out there. The ones I would recommend are Jason Ellison's excellent check_dell_openmanage plugin and HPC Community's "official" plugin. They are both SNMP only, and are simpler then check_openmanage in the sense that they don't check as many OIDs and don't use as many features of OMSA. That way, they are less dependent on recent OMSA versions, have faster execution time, and are probably also less error prone.

11.1.6   Is VMware ESXi supported?

Unfortunately, no. OMSA for ESXi has serious limitations that impacts monitoring with check_openmanage. The following quote is from the OMSA 6.3 documentation intro:

NOTE: While ESXi supports SNMP traps, it does not support hardware inventory through SNMP.

For ESXi you can't use check_openmanage. Options for monitoring ESXi include setting up an SNMP trap receiver on the Nagios server, and configuring it as the trap destination on the ESXi hosts.

11.1.7   Is check_openmanage ePN compatible?

("ePN" is the Nagios' embedded perl interpreter. Many people turn this off by default, as it often does more damage than good.)

Yes, check_openmanage if fully ePN compatible, and should work fine with or without ePN. If your Nagios server has ePN turned on, and for some reason you want to disable ePN for this plugin, add the following line among the first 10 lines of the script:

# nagios: -epn

If you are running Nagios 2.x, you should specify perl <script> in your Nagios config if you have ePN enabled and want to disable it for this plugin.

11.1.8   How to get notified of new releases?

There isn't an email list or similar, but new releases are announced on Freshmeat and NagiosExchange. The easiest way to be notified of new releases is to subscribe to thecheck_openmanage project on Freshmeat. You'll need a Freshmeat account, getting one is easy.

11.1.9   How can I contact you if I have problems, bug reports, feature requests etc.?

You can contact me via email (see the top of the document), but I prefer that you use the Nagios users mailing list for discussions about check_openmanage:

nagios-users@lists.sourceforge.net

I won't be angry or upset if you email me directly, but in general it's better to discuss problems in a public forum.

Depending on the time of day etc., you can also reach me on the IRC, as trondham on the #nagios channel on Freenode.

11.2   OpenManage (OMSA)

11.2.1   Which version of OMSA is OK?

Check_openmanage is tested with OMSA versions 5.3 and later. All these should be OK to use. Note that some versions don't play well with Windows and SNMP. For the best results, use the newest release available.

11.2.2   Why do I get weird results from check_openmanage on my old server and/or old OMSA version?

Upgrade your Openmanage (OMSA) version. Check_openmanage is developed with OMSA versions 5.3 and later. You should always try to use the latest version. Older versions of OMSA lack some of the features that check_openmanage needs to give accurate and detailed output.

Some really old servers (5.generation and older, e.g. 2550, 2450) can't run newer OMSA than 4.5. Hence, these old servers are not supported by check_openmanage.

11.2.3   How can I find out which version of OMSA my server is running?

This can be done in many ways. Log in to your server and type the command omreport about (or omreport.exe about on Windows):

$ omreport about
Product name : Server Administrator
Version      : 6.1.0
Copyright    : Copyright (C) Dell Inc. 1995-2009. All rights reserved.
Company      : Dell Inc.

You can also use check_openmanage to display the OMSA version, with the -d or --debug option:

$ check_openmanage -H myhost -d | head -n 3
 System:      PowerEdge M600
 ServiceTag:  88CBS3J                  OMSA version:    6.1.0
 BIOS/date:   2.1.4 08/15/2008         Plugin version:  3.5.5

11.2.4   My boss won't let me upgrade OMSA

Your boss is a moron. Also, that wasn't technically a question.

11.2.5   Our security policy doesn't allow OMSA upgrades

Your security policy is broken. That wasn't a question either.

11.3   Common Errors

11.3.1   ERROR: Dell OpenManage Server Administrator (OMSA) is not installed

This error normally indicates that OpenManage is not installed on the host. However, it may be that OpenManage is installed in a location other than the default. In this case, the omreport binary is not in the search path of check_openmanage. You can work around this by using the --omreport option, like this:

check_openmanage --omreport /usr/local/bin/omreport

The above should be configured in nrpe.cfg if you're using NRPE. If you're using the .exe file for Windows with NSClient++, you should configure this in the file nsc.ini on the host. Example:

check_openmanage.exe --omreport P:\dellopenmanage\oma\bin\omreport.exe

If the omreport or omreport.exe binary is installed in a place where you think that check_openmanage should look by default, send me a note.

11.3.2   ERROR: You need perl module Net::SNMP to run check_openmanage in SNMP mode

The perl module Net::SNMP is not installed, or not available to the perl interpreter. This module is required for SNMP. See the prerequisites section.

11.3.3   SNMP CRITICAL: No response from remote host '10.1.2.3'

This error indicates that the SNMP daemon is not running or not responding. There is no reply from the monitored server on port 161 (or the port specified by the --portoption). Check the SNMP service.

You will also get this error if the SNMP community name doesn't match. If this is the case, inspect the SNMP settings on the server and verify that the community name matches the -C option given to the plugin.

11.3.4   ERROR: (SNMP) OpenManage is not installed or is not working correctly

This error indicates that the SNMP service is responding, but OpenManage OIDs are not present. To be specific, check_openmanage checks the OID1.3.6.1.4.1.674.10892.1.300.10.1.9.1 for the chassis model name, to determine if OpenManage is running or not.

The error could be caused by different problems. OpenManage could simply not be running. The SNMP daemon may not be configured with OpenManage OIDs. The SNMP part of OpenManage may not be installed or running. For Linux, the snmpd.conf file should have the following:

smuxpeer .1.3.6.1.4.1.674.10892.1

This should be added by OpenManage at install time. The simple solution may be to reinstall OpenManage and look for errors during installation.

11.3.5   SNMP CRITICAL: Received genError(5) error-status at error-index 1

Some OpenManage versions perform poorly on Windows, especially versions 5.4.0 and earlier. If you get this error on a Windows host, check the OpenManage version and consider upgrading.

This error is from the Net::SNMP perl module.

11.3.6   PLUGIN TIMEOUT: check_openmanage timed out after 30 seconds

If your server is under very heavy load, it may take some time for check_openmanage to finish, especially if you run the plugin locally via NRPE or similar. The default plugin timeout is 30 seconds, but you can set a different timeout via the -t|--timeout option. Under normal load, the plugin should finish in about 2 seconds when run locally. Checking via SNMP is even faster.

Be aware that NRPE has it's own timeout, also adjustable.

11.3.7   INTERNAL ERROR: blah blah

The plugin will output any perl warnings that occur during execution as internal errors with unknown state. If you get one or more internal errors like this, you may have hit a bug in the plugin. Please contact me if you get internal errors.

11.3.8   UNKNOWN: Problem running 'omreport foo bar'

UNKNOWN: Problem running 'omreport chassis fans': Error! No fan probes found on this system.
UNKNOWN: Problem running 'omreport chassis temps': Error! No temperature probes found on this system.
UNKNOWN: Problem running 'omreport chassis volts': Error! No voltage probes found on this system.
[...etc...]

These are general errors that can have different causes. OMSA, especially old versions, are known to have bugs that cause the occasional hiccup. A simple restart of OMSA (Linux: "srvadmin-services.sh restart") may prove to be the solution and should be attempted first.

Other known causes include the following:

  • Old or incompatible BIOS version. For example, later versions of OMSA may expect the BIOS version to be relatively up-to-date. Make sure that the BIOS version is new enough.
  • Exhausted semaphore pool on Linux. OMSA requires a few semaphores in order to run commands. Check the semaphore resource pool with ipcs and consider increasing the number of semaphores available.

11.3.9   Storage Error! No controllers found

From version 3.6.0 of the plugin, storage is no longer allowed to be absent. The reason for this is that there have been several issues with OMSA that prevents storage from being displayed. Check_openmanage will no longer silently ignore this when it happens. Instead, it will give this alert.

There may be cases where this alert is a false positive:

  • Diskless systems
  • Servers without a Dell storage controller, which OMSA does not recognize.

If you get this alert and your system falls into one of the above categories, you should specify either --no-storage or --check storage=0 to prevent check_openmanage to check the storage system in the first place:

check_openmanage --no-storage [other options...]
check_openmanage --check storage=0 [other options...]

If the server in question has storage that should be monitored, you'll have to check OMSA, particularly which components of it are installed.

12   Reporting bugs, proposing new features etc.

Please let me know if you are experiencing bugs, have feature requests, or suggestions on how to improve check_openmanage. We use this plugin in production at theUniversity of Oslo, on many Dell servers of different models, but we don't use all the different features of the plugin. While the plugin is bug-free for us, it might not be for you, so let me know if you have problems.

Please send bug reports or feature requests to the Nagios users mailing list. I read postings to this list frequently:

nagios-users@lists.sourceforge.net

You can also email me directly, but then other users won't benefit from the discussion. Unless you have security issues or other concerns are preventing you from using the mailing list, it is better to discuss problems in a public forum.

Depending on the time of day etc., you can also reach me on the IRC, on the #nagios channel on Freenode.

13   Disclaimer

This is free software. Use at your own risk.

此文章由 flyinweb 于 2011-01-13 09:54:18 编辑

本日志由 flyinweb 于 2011-01-13 09:25:40 发表,目前已经被浏览 4954 次,评论 0 次;

作者添加了以下标签: check_openmanage

引用通告:http://www.517sou.net/Article/564/Trackback.ashx

评论订阅:http://www.517sou.net/Article/564/Feeds.ashx

评论列表

    暂时没有评论
(必填)
(必填,不会被公开)