Health Monitoring and vovnotifyd

You can set up basic Altair Accelerator health monitoring tests with mail notification. When the Accelerator system gets into any of your defined "unhealthy" conditions, a list of configured users will receive alert email notifications.

Tests are provided that check for the following conditions:
  • Long jobs that are stuck: stuck jobs do not use any CPU
  • Someone has jobs waiting in queue for too long
  • Any user has an unusually high ratio of failed jobs
  • Any host(s) fails all jobs
  • The server size (and other server related parameters) is growing
  • There are too many out of queue jobs (for Allocator (also known as MultiQueue) setup only)
The checks are all procedures of which the name starts with "doTestHealth". These checks are defined in one or more files of the following files:
Role Location Notes
Global $VOVDIR/tcl/vtcl/vovhealthlib.tcl Part of distribution
Site specific $VOVDIR/local/vovhealthlib.tcl Optional
Project specific PROJECT.swd/vovnotifyd/vovhealthlib.tcl Optional

Configuring Health Monitoring

By default, all checks defined in the vovhealthlib.tcl files are enabled.

If you want to change parameters in the health checks, you need to change the config.tcl file in the vovnotifyd directory.

The following table contains an example of vovnotifyd/config.tcl:
set NOTIFYD(server)        "tiger"
set NOTIFYD(port)          25
set NOTIFYD(sourcedomain)  "mycompany.com"

set admin  "dexin"
set cadmgr "john"

#
# Check if we have long stuck jobs. Check every 1 minute. If we have such
# jobs, send alert emails to the owner of the job (@USER@) and
# admin (here "dexin").
#
# Definition of "long stuck job": has been running at lease 10 hours and but has used no
# more than 20 seconds CPU time in total.
#
registerHealthCheck "doTestHealthLongJobsNoCpu -longJobDur 10h -minCpu 20"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

#
# Check if we have jobs stuck, i.e., jobs that are not burning any CPU at all. And this situation
# has been persistent for at least 10 minutes.
#
# Check every 1 minute. If we have such jobs, send alert emails to the owner of the
# job (@USER@) and admin (here "dexin").
#
# This check is similar to doTestHealthLongJobsNoCpu but will be quicker to detect stuck jobs.
#
registerHealthCheck "doTestHealthJobStuck -maxNoCpuTime 10m"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

#
# Check if any user has too many failed jobs. Check every 30 minutes. If we
# have such users, send alert emails to the owner of the job (@USER@),
# admin (here "dexin") and cadmgr ( here "john" ).
#
# Definition of "too many failures": a user has at least 1000 jobs in NetworkComputer
# with at least 90% failures
#
registerHealthCheck "doTestHealthTooManyFailures -minJobs 1000 -failRatio 0.9"  -checkFreq 30m  -mailFreq  1d  -recipients "@USER@ $admin $cadmgr"


#
# Check if the server size is growing. Check other server related parameters as
# well, including number of jobs, number of queued jobs, etc.
#
# For everything that is checked, if the number grows over 60.0% compared to last time
# it is checked, send alert emails to the admin (here "dexin")
# and cadmgr ( here "john" ).
#
# Also send alert emails if the number of files is 5.0 times or more than the number
# of jobs.
#
# Check every 2 hours and send such alert emails once a day (1d).
#
registerHealthCheck "doTestHealthServerSize -filejobsRatio 5.0 -warnPercent 60.0"  -checkFreq 2h  -mailFreq  1d  -recipients "$admin $cadmgr"

#
# Check if some user has jobs sitting in the queue for too long.
# Check every 2 hours.
#
# If we find such users, send alert emails to the user
# and cadmgr ( here "john" ). Send such alert emails once a day (1d).
#
# Definition of "waiting for too long": none of jobs in one category(bucket)
# get dispatched in the last 4 hours.
#
#
registerHealthCheck "doTestHealthJobsWaitingForTooLong -maxQueueTime 4h"  -checkFreq 2h  -mailFreq  1d  -recipients "@USER@ $cadmgr"

The config.tcl file is checked for updates at regular intervals controlled by the variable NOTIFYD(timeout).

vovnotifyd

The vovnotifyd daemon performs two functions:
  • Sends mail notification based on job events, according to the MAILTO property of the jobs; (Altair Accelerator only)
  • It performs periodic system health checks, and sends email when it discovers issues. (Monitor and Accelerator)

There are a number of predefined system health check procedures included with the Altair Accelerator. You can also write health check procedures to monitor specialized conditions at your site. See below.

Configuring vovnotifyd

Table 1. Summary information for vovnotifyd
Config files vnc.swd/vovnotifyd/config.tcl

vnc.swd/vovnotifyd/config_smtp.tcl

vnc.swd/vovnotifyd/config_export.tcl

Info file vnc.swd/vovnotifyd/info.tcl
Auxiliary files $VOVDIR/tcl/vtcl/vovhealthlib.tcl

$VOVDIR/local/vovhealthlib.tcl

vnc.swd/vovnotifyd/vovhealthlib.tcl

The easiest way to configure notification is from the Admin page of the browser interface. Click on the Daemons item on the left-hand menu, then click config in the row for vovnotifyd.

To manually configure vovnotifyd, you need to create the directory vovnotifyd inside of the server working directory (.swd) and copy the configuration file template, $VOVDIR/etc/config/vovnotifyd/config.tcl into the newly-created vovnotifyd directory. The configuration template will need to be modified to match the settings of your mail server environment.
% cd `vovserverdir -p .`
% mkdir vovnotifyd
File: $VOVDIR/etc/config/vovnotifyd/config.tcl
# Notification configuration file.
# Should be placed in the vovnotifyd directory of the .swd.
# All settings are required unless specified otherwise.
# Unused optional settings should be commented out.

# Create an e-mail address map, stackable, optional
addUserToEmailAddressMap  rtdamgr john@mydomain.com

### Altair Monitor-specific settings
# See notification configuration documentation in Altair Monitor Admin Guide
# ConfigureTag     TAG OPTION VALUE
# ConfigureFeature FEATURE OPTION VALUE

### Examples:
# ConfigureTag MGC -poc { john mary }
# ConfigureFeature EDA/MATLAB -longcheckout 2d -userlongcheckout john 1w -mincap 5 -triggerperc 90
# ConfigureFeature SIMULINK -poc bob -mincap 10 -triggeruse 12
To start the daemon, either use nc cmd vovdaemonmgr start vovnotifyd or start it manually from within the vovnotifyd directory:
% vovproject enable vnc
% cd `vovserverdir -p vovnotifyd`
% vovnotifyd

Autostart vovnotifyd

In the directory vnc.swd/autostart create a script called start_vovnotifyd.tcl with the following content:
% cd `vovserverdir -p .`
% mkdir autostart
% cp $VOVDIR/etc/autostart/start_vovnotifyd.tcl  autostart/start_vovnotifyd.tcl

Configure Email Addresses

You can use the config.tcl file to set the email addresses to be used for each user. You can choose one of the following methods:
  • Call addUserToEmailAddressMap USERNAME EMAIL
  • Override the entire procedure getEmailAddress
# Fragment of vovnotifyd/config.tcl file.

# Method 1.
addUserToEmailAddressMap john  John.Smith@my.company.com

# Method 2. Assume we can get an address from LDAP
#           The LDAP subsystem needs to be configured.
proc getEmailAddress { user } {
    set email [VovLDAP::getEmail $user]
    if { $email != "" } {
       return $email
    } else {
       return $user
    }
}

Write Localized Health Checks

The vovnotifyd daemons runs in the vovsh binary, so all the VTK API procedures are available to you.

The standard checks procedures are defined in the file $VOVDIR/tcl/vtcl/vovhealthlib.tcl.

The health check procedures are loaded using a search path. First, they are loaded from the file given above, then from $VOVDIR/local/vovhealthlib.tcl, and then from vovhealthlib.tcl in the vovnotifyd working directory. This permits you to redefine health check procedures on a site-wide or project-specific basis.
Note: Any local procedure names should begin with 'doTestHealth' like the system ones; some things depend on this convention.
For local procedures to be shown by the browser UI, you need to add a line into your vovhealthlib.tcl file like:
set HEALTHLIB_PRODUCT_MAP(doTestHealthYourProcedure)        "nc"

The product names are those returned by the procedure vtk_product_get_info -name: nc, lm, lam, ft, wa

Be sure your procedures are robust and handle error conditions well, that is, by catching all exec{}, open{} and other procedures that can sometimes fail.

Note: Changes made to any of the vovhealthlib.tcl files will not take effect until the vovnotifyd daemon is restarted.

Alternate Methods of Sending Email

When direct SMTP does not work, there are other methods to send email that may.

Using a Mailer Program

The standard sendMail procedure checks the config file variable NOTIFYD(mailprog), which has a default of SMTP. This may also be set to the name of a mail program that can accept a message body on its 'stdin' stream, like /bin/mail on UNIX. This program is also expected to have an -s option to specify the subject.

Overriding the sendMail Procedure

The procedure sendMail defined in the file vovnotifydlib.tcl uses SMTP (Simple Mail Transfer Protocol) to send email notification, by connecting directly to a mail transport agent (MTA), such as sendmail, postfix, or Microsoft Exchange. The vovnotifyd is only capable of simple unauthenticated, unencrypted SMTP, and may not work if the MTA is configured to require authentication or transport-level security.

If your MTA is so configured, you may need to use an alternate method to send mail. One method is to redefine the procedure daemon's sendMail procedure. There is also another way, see below.
# Overriding sendMail in the vovnotifyd/config.tcl file.
proc sendMail { recList subj msg } {
    catch {exec yourMailScript.csh $recList $subj $msg }
}

The script that you write can perhaps be a simple wrapper around /bin/mail on UNIX, or you can even exec such a mailer program directly in the sendMail procedure.

If the host on which vovnotifyd runs has a local MTA which relays to the domain's MTA which requires authentication or TLS, the local MTA may accept unauthenticated cleartext SMTP. If this is the case, you may be able direct SMPT with 'localhost' as the mailer host.

Health Checks

This is a list and short description of all built in health checks monitored by notifyd.

By default, all checks occur every 10 minutes.
AllJobsFailedOnHost
This check will check for hosts on which all jobs have failed during a given time window (default 3600)
CheckAlerts
This check will send a summary of all the current alerts via notification email
CheckDownTaskers
This check will check for taskers enabled in taskers.tcl but not running
CheckJobsReqstRam
This will alert you for jobs that have used more RAM than requested.
CheckTaskerset
This will check the status of running taskers.
CheckVendorLicenseExpiration
This will alert you when a vendor license is going to expire in a few days.
Daemons
This will check the daemons.
FalseLicenceUsage
This will alert you when jobs use licenses that have not been declared
JobStuck
This will find stuck jobs, which are running but burning no CPU. The default check frequency on this is 2 hours.
JobsWaitingForTooLong
Just as the name suggests, this reports on jobs that have been waiting for too long.
LongJobs
This will check for jobs that run too long.
LongJobsNoCpu
Like the previous check, but will check for jobs that are not using any CPU.
RamSentry
This will check jobs under Ram Sentry protection.
ServerSize
This will check the growth of vovserver.
TooManyFailures
As the name suggests, this will check for a user submitting too many failed jobs.
TooManyOutOfQueueJobs
This will check for too many jobs being run outside of Altair Accelerator.