Health Monitoring and vovnotifyd

You can set up basic Altair Accelerator health monitoring tests with mail notification. When the Accelerator system gets into any of your defined "unhealthy" conditions, a list of configured users will receive alert email notifications.

Tests are provided that check for the following conditions:

Long jobs that are stuck: stuck jobs do not use any CPU
Someone has jobs waiting in queue for too long
Any user has an unusually high ratio of failed jobs
Any host(s) fails all jobs
The server size (and other server related parameters) is growing
There are too many out of queue jobs (for Allocator (also known as MultiQueue) setup only)

The checks are all procedures of which the name starts with "doTestHealth". These checks are defined in one or more files of the following files:

Role	Location	Notes
Global	$VOVDIR/tcl/vtcl/vovhealthlib.tcl	Part of distribution
Site specific	$VOVDIR/local/vovhealthlib.tcl	Optional
Project specific	PROJECT.swd/vovnotifyd/vovhealthlib.tcl	Optional

Configuring Health Monitoring

By default, all checks defined in the vovhealthlib.tcl files are enabled.

If you want to change parameters in the health checks, you need to change the config.tcl file in the vovnotifyd directory.

The following table contains an example of vovnotifyd/config.tcl:

set NOTIFYD(server)        "tiger"
set NOTIFYD(port)          25
set NOTIFYD(sourcedomain)  "mycompany.com"

set admin  "dexin"
set cadmgr "john"

#
# Check if we have long stuck jobs. Check every 1 minute. If we have such
# jobs, send alert emails to the owner of the job (@USER@) and
# admin (here "dexin").
#
# Definition of "long stuck job": has been running at lease 10 hours and but has used no
# more than 20 seconds CPU time in total.
#
registerHealthCheck "doTestHealthLongJobsNoCpu -longJobDur 10h -minCpu 20"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

#
# Check if we have jobs stuck, i.e., jobs that are not burning any CPU at all. And this situation
# has been persistent for at least 10 minutes.
#
# Check every 1 minute. If we have such jobs, send alert emails to the owner of the
# job (@USER@) and admin (here "dexin").
#
# This check is similar to doTestHealthLongJobsNoCpu but will be quicker to detect stuck jobs.
#
registerHealthCheck "doTestHealthJobStuck -maxNoCpuTime 10m"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

#
# Check if any user has too many failed jobs. Check every 30 minutes. If we
# have such users, send alert emails to the owner of the job (@USER@),
# admin (here "dexin") and cadmgr ( here "john" ).
#
# Definition of "too many failures": a user has at least 1000 jobs in NetworkComputer
# with at least 90% failures
#
registerHealthCheck "doTestHealthTooManyFailures -minJobs 1000 -failRatio 0.9"  -checkFreq 30m  -mailFreq  1d  -recipients "@USER@ $admin $cadmgr"


#
# Check if the server size is growing. Check other server related parameters as
# well, including number of jobs, number of queued jobs, etc.
#
# For everything that is checked, if the number grows over 60.0% compared to last time
# it is checked, send alert emails to the admin (here "dexin")
# and cadmgr ( here "john" ).
#
# Also send alert emails if the number of files is 5.0 times or more than the number
# of jobs.
#
# Check every 2 hours and send such alert emails once a day (1d).
#
registerHealthCheck "doTestHealthServerSize -filejobsRatio 5.0 -warnPercent 60.0"  -checkFreq 2h  -mailFreq  1d  -recipients "$admin $cadmgr"

#
# Check if some user has jobs sitting in the queue for too long.
# Check every 2 hours.
#
# If we find such users, send alert emails to the user
# and cadmgr ( here "john" ). Send such alert emails once a day (1d).
#
# Definition of "waiting for too long": none of jobs in one category(bucket)
# get dispatched in the last 4 hours.
#
#
registerHealthCheck "doTestHealthJobsWaitingForTooLong -maxQueueTime 4h"  -checkFreq 2h  -mailFreq  1d  -recipients "@USER@ $cadmgr"

The config.tcl file is checked for updates at regular intervals controlled by the variable NOTIFYD(timeout).

vovnotifyd

The vovnotifyd daemon performs two functions:

Sends mail notification based on job events, according to the MAILTO property of the jobs; (Altair Accelerator only)
It performs periodic system health checks, and sends email when it discovers issues. (Monitor and Accelerator)

There are a number of predefined system health check procedures included with the Altair Accelerator. You can also write health check procedures to monitor specialized conditions at your site. See below.

Configuring vovnotifyd

Table 1. Summary information for vovnotifyd
Config files	vnc.swd/vovnotifyd/config.tcl vnc.swd/vovnotifyd/config_smtp.tcl vnc.swd/vovnotifyd/config_export.tcl
Info file	vnc.swd/vovnotifyd/info.tcl
Auxiliary files	$VOVDIR/tcl/vtcl/vovhealthlib.tcl $VOVDIR/local/vovhealthlib.tcl vnc.swd/vovnotifyd/vovhealthlib.tcl

The easiest way to configure notification is from the Admin page of the browser interface. Click on the Daemons item on the left-hand menu, then click config in the row for vovnotifyd.

To manually configure vovnotifyd, you need to create the directory vovnotifyd inside of the server working directory (.swd) and copy the configuration file template, $VOVDIR/etc/config/vovnotifyd/config.tcl into the newly-created vovnotifyd directory. The configuration template will need to be modified to match the settings of your mail server environment.

% cd `vovserverdir -p .`
% mkdir vovnotifyd

File: $VOVDIR/etc/config/vovnotifyd/config.tcl

# Notification configuration file.
# Should be placed in the vovnotifyd directory of the .swd.
# All settings are required unless specified otherwise.
# Unused optional settings should be commented out.

# Create an e-mail address map, stackable, optional
addUserToEmailAddressMap  rtdamgr john@mydomain.com

### Altair Monitor-specific settings
# See notification configuration documentation in Altair Monitor Admin Guide
# ConfigureTag     TAG OPTION VALUE
# ConfigureFeature FEATURE OPTION VALUE

### Examples:
# ConfigureTag MGC -poc { john mary }
# ConfigureFeature EDA/MATLAB -longcheckout 2d -userlongcheckout john 1w -mincap 5 -triggerperc 90
# ConfigureFeature SIMULINK -poc bob -mincap 10 -triggeruse 12

To start the daemon, either use

nc cmd vovdaemonmgr start
                    vovnotifyd

or start it manually from within the vovnotifyd directory:

% vovproject enable vnc
% cd `vovserverdir -p vovnotifyd`
% vovnotifyd

Autostart vovnotifyd

In the directory vnc.swd/autostart create a script called start_vovnotifyd.tcl with the following content:

% cd `vovserverdir -p .`
% mkdir autostart
% cp $VOVDIR/etc/autostart/start_vovnotifyd.tcl  autostart/start_vovnotifyd.tcl

Configure Email Addresses

You can use the config.tcl file to set the email addresses to be used for each user. You can choose one of the following methods:

Call addUserToEmailAddressMap USERNAME EMAIL
Override the entire procedure getEmailAddress

# Fragment of vovnotifyd/config.tcl file.

# Method 1.
addUserToEmailAddressMap john  John.Smith@my.company.com

# Method 2. Assume we can get an address from LDAP
#           The LDAP subsystem needs to be configured.
proc getEmailAddress { user } {
    set email [VovLDAP::getEmail $user]
    if { $email != "" } {
       return $email
    } else {
       return $user
    }
}

Write Localized Health Checks

The vovnotifyd daemons runs in the vovsh binary, so all the VTK API procedures are available to you.

The standard checks procedures are defined in the file $VOVDIR/tcl/vtcl/vovhealthlib.tcl.

The health check procedures are loaded using a search path. First, they are loaded from the file given above, then from $VOVDIR/local/vovhealthlib.tcl, and then from vovhealthlib.tcl in the vovnotifyd working directory. This permits you to redefine health check procedures on a site-wide or project-specific basis.

Note: Any local procedure names should begin with 'doTestHealth' like the system ones; some things depend on this convention.

For local procedures to be shown by the browser UI, you need to add a line into your vovhealthlib.tcl file like:

set HEALTHLIB_PRODUCT_MAP(doTestHealthYourProcedure)        "nc"

The product names are those returned by the procedure vtk_product_get_info -name: nc, lm, lam, ft, wa

Be sure your procedures are robust and handle error conditions well, that is, by catching all exec{}, open{} and other procedures that can sometimes fail.

Note: Changes made to any of the vovhealthlib.tcl files will not take effect until the vovnotifyd daemon is restarted.

Alternate Methods of Sending Email

When direct SMTP does not work, there are other methods to send email that may.

Using a Mailer Program

The standard sendMail procedure checks the config file variable NOTIFYD(mailprog), which has a default of SMTP. This may also be set to the name of a mail program that can accept a message body on its 'stdin' stream, like /bin/mail on UNIX. This program is also expected to have an -s option to specify the subject.

Overriding the sendMail Procedure

The procedure sendMail defined in the file vovnotifydlib.tcl uses SMTP (Simple Mail Transfer Protocol) to send email notification, by connecting directly to a mail transport agent (MTA), such as sendmail, postfix, or Microsoft Exchange. The vovnotifyd is only capable of simple unauthenticated, unencrypted SMTP, and may not work if the MTA is configured to require authentication or transport-level security.

If your MTA is so configured, you may need to use an alternate method to send mail. One method is to redefine the procedure daemon's sendMail procedure. There is also another way, see below.

# Overriding sendMail in the vovnotifyd/config.tcl file.
proc sendMail { recList subj msg } {
    catch {exec yourMailScript.csh $recList $subj $msg }
}

The script that you write can perhaps be a simple wrapper around /bin/mail on UNIX, or you can even exec such a mailer program directly in the sendMail procedure.

If the host on which vovnotifyd runs has a local MTA which relays to the domain's MTA which requires authentication or TLS, the local MTA may accept unauthenticated cleartext SMTP. If this is the case, you may be able direct SMPT with 'localhost' as the mailer host.

Health Checks

This is a list and short description of all built in health checks monitored by notifyd.

By default, all checks occur every 10 minutes.

AllJobsFailedOnHost: This check will check for hosts on which all jobs have failed during a given time window (default 3600)
CheckAlerts: This check will send a summary of all the current alerts via notification email
CheckDownTaskers: This check will check for taskers enabled in taskers.tcl but not running
CheckJobsReqstRam: This will alert you for jobs that have used more RAM than requested.
CheckTaskerset: This will check the status of running taskers.
CheckVendorLicenseExpiration: This will alert you when a vendor license is going to expire in a few days.
Daemons: This will check the daemons.
FalseLicenceUsage: This will alert you when jobs use licenses that have not been declared
JobStuck: This will find stuck jobs, which are running but burning no CPU. The default check frequency on this is 2 hours.
JobsWaitingForTooLong: Just as the name suggests, this reports on jobs that have been waiting for too long.
LongJobs: This will check for jobs that run too long.
LongJobsNoCpu: Like the previous check, but will check for jobs that are not using any CPU.
RamSentry: This will check jobs under Ram Sentry protection.
ServerSize: This will check the growth of vovserver.
TooManyFailures: As the name suggests, this will check for a user submitting too many failed jobs.
TooManyOutOfQueueJobs: This will check for too many jobs being run outside of Altair Accelerator.