Health Monitoring and vovnotifyd
You can set up basic Altair Accelerator health monitoring tests with mail notification. When the Accelerator system gets into any of your defined "unhealthy" conditions, a list of configured users will receive alert email notifications.
- Long jobs that are stuck: stuck jobs do not use any CPU
- Someone has jobs waiting in queue for too long
- Any user has an unusually high ratio of failed jobs
- Any host(s) fails all jobs
- The server size (and other server related parameters) is growing
- There are too many out of queue jobs (for Allocator (also known as MultiQueue) setup only)
Role | Location | Notes |
---|---|---|
Global | $VOVDIR/tcl/vtcl/vovhealthlib.tcl | Part of distribution |
Site specific | $VOVDIR/local/vovhealthlib.tcl | Optional |
Project specific | PROJECT.swd/vovnotifyd/vovhealthlib.tcl | Optional |
Configuring Health Monitoring
By default, all checks defined in the vovhealthlib.tcl files are enabled.
If you want to change parameters in the health checks, you need to change the config.tcl file in the vovnotifyd directory.
set NOTIFYD(server) "tiger"
set NOTIFYD(port) 25
set NOTIFYD(sourcedomain) "mycompany.com"
set admin "dexin"
set cadmgr "john"
#
# Check if we have long stuck jobs. Check every 1 minute. If we have such
# jobs, send alert emails to the owner of the job (@USER@) and
# admin (here "dexin").
#
# Definition of "long stuck job": has been running at lease 10 hours and but has used no
# more than 20 seconds CPU time in total.
#
registerHealthCheck "doTestHealthLongJobsNoCpu -longJobDur 10h -minCpu 20" -checkFreq 1m -mailFreq 1d -recipients "@USER@ $admin"
#
# Check if we have jobs stuck, i.e., jobs that are not burning any CPU at all. And this situation
# has been persistent for at least 10 minutes.
#
# Check every 1 minute. If we have such jobs, send alert emails to the owner of the
# job (@USER@) and admin (here "dexin").
#
# This check is similar to doTestHealthLongJobsNoCpu but will be quicker to detect stuck jobs.
#
registerHealthCheck "doTestHealthJobStuck -maxNoCpuTime 10m" -checkFreq 1m -mailFreq 1d -recipients "@USER@ $admin"
#
# Check if any user has too many failed jobs. Check every 30 minutes. If we
# have such users, send alert emails to the owner of the job (@USER@),
# admin (here "dexin") and cadmgr ( here "john" ).
#
# Definition of "too many failures": a user has at least 1000 jobs in NetworkComputer
# with at least 90% failures
#
registerHealthCheck "doTestHealthTooManyFailures -minJobs 1000 -failRatio 0.9" -checkFreq 30m -mailFreq 1d -recipients "@USER@ $admin $cadmgr"
#
# Check if the server size is growing. Check other server related parameters as
# well, including number of jobs, number of queued jobs, etc.
#
# For everything that is checked, if the number grows over 60.0% compared to last time
# it is checked, send alert emails to the admin (here "dexin")
# and cadmgr ( here "john" ).
#
# Also send alert emails if the number of files is 5.0 times or more than the number
# of jobs.
#
# Check every 2 hours and send such alert emails once a day (1d).
#
registerHealthCheck "doTestHealthServerSize -filejobsRatio 5.0 -warnPercent 60.0" -checkFreq 2h -mailFreq 1d -recipients "$admin $cadmgr"
#
# Check if some user has jobs sitting in the queue for too long.
# Check every 2 hours.
#
# If we find such users, send alert emails to the user
# and cadmgr ( here "john" ). Send such alert emails once a day (1d).
#
# Definition of "waiting for too long": none of jobs in one category(bucket)
# get dispatched in the last 4 hours.
#
#
registerHealthCheck "doTestHealthJobsWaitingForTooLong -maxQueueTime 4h" -checkFreq 2h -mailFreq 1d -recipients "@USER@ $cadmgr"
The config.tcl file is checked for updates at regular intervals controlled by the variable NOTIFYD(timeout).
vovnotifyd
- Sends mail notification based on job events, according to the MAILTO property of the jobs; (Altair Accelerator only)
- It performs periodic system health checks, and sends email when it discovers issues. (Monitor and Accelerator)
There are a number of predefined system health check procedures included with the Altair Accelerator. You can also write health check procedures to monitor specialized conditions at your site. See below.
Configuring vovnotifyd
Config files | vnc.swd/vovnotifyd/config.tcl vnc.swd/vovnotifyd/config_smtp.tcl vnc.swd/vovnotifyd/config_export.tcl |
Info file | vnc.swd/vovnotifyd/info.tcl |
Auxiliary files | $VOVDIR/tcl/vtcl/vovhealthlib.tcl
$VOVDIR/local/vovhealthlib.tcl vnc.swd/vovnotifyd/vovhealthlib.tcl |
The easiest way to configure notification is from the Admin page of the browser interface. Click on the Daemons item on the left-hand menu, then click config in the row for vovnotifyd.
% cd `vovserverdir -p .`
% mkdir vovnotifyd
# Notification configuration file.
# Should be placed in the vovnotifyd directory of the .swd.
# All settings are required unless specified otherwise.
# Unused optional settings should be commented out.
# Create an e-mail address map, stackable, optional
addUserToEmailAddressMap rtdamgr john@mydomain.com
### Altair Monitor-specific settings
# See notification configuration documentation in Altair Monitor Admin Guide
# ConfigureTag TAG OPTION VALUE
# ConfigureFeature FEATURE OPTION VALUE
### Examples:
# ConfigureTag MGC -poc { john mary }
# ConfigureFeature EDA/MATLAB -longcheckout 2d -userlongcheckout john 1w -mincap 5 -triggerperc 90
# ConfigureFeature SIMULINK -poc bob -mincap 10 -triggeruse 12
nc cmd vovdaemonmgr start
vovnotifyd
or start it manually from within the
vovnotifyd directory:
% vovproject enable vnc
% cd `vovserverdir -p vovnotifyd`
% vovnotifyd
Autostart vovnotifyd
% cd `vovserverdir -p .`
% mkdir autostart
% cp $VOVDIR/etc/autostart/start_vovnotifyd.tcl autostart/start_vovnotifyd.tcl
Configure Email Addresses
- Call
addUserToEmailAddressMap USERNAME EMAIL
- Override the entire procedure
getEmailAddress
# Fragment of vovnotifyd/config.tcl file.
# Method 1.
addUserToEmailAddressMap john John.Smith@my.company.com
# Method 2. Assume we can get an address from LDAP
# The LDAP subsystem needs to be configured.
proc getEmailAddress { user } {
set email [VovLDAP::getEmail $user]
if { $email != "" } {
return $email
} else {
return $user
}
}
Write Localized Health Checks
The vovnotifyd daemons runs in the vovsh binary, so all the VTK API procedures are available to you.
The standard checks procedures are defined in the file $VOVDIR/tcl/vtcl/vovhealthlib.tcl.
set HEALTHLIB_PRODUCT_MAP(doTestHealthYourProcedure) "nc"
The product names are those returned by the procedure vtk_product_get_info
-name
: nc, lm, lam, ft, wa
Be sure your procedures are robust and handle error conditions well, that is, by catching all exec{}, open{} and other procedures that can sometimes fail.
Alternate Methods of Sending Email
When direct SMTP does not work, there are other methods to send email that may.
Using a Mailer Program
The standard sendMail procedure checks the config file variable
NOTIFYD(mailprog), which has a default of
SMTP. This may also be set to the name of a mail program
that can accept a message body on its 'stdin' stream, like
/bin/mail
on UNIX. This program is
also expected to have an -s option to specify the subject.
Overriding the sendMail Procedure
The procedure sendMail defined in the file vovnotifydlib.tcl uses SMTP (Simple Mail Transfer Protocol) to send email notification, by connecting directly to a mail transport agent (MTA), such as sendmail, postfix, or Microsoft Exchange. The vovnotifyd is only capable of simple unauthenticated, unencrypted SMTP, and may not work if the MTA is configured to require authentication or transport-level security.
# Overriding sendMail in the vovnotifyd/config.tcl file.
proc sendMail { recList subj msg } {
catch {exec yourMailScript.csh $recList $subj $msg }
}
The script that you write can perhaps be a simple wrapper around
/bin/mail
on UNIX, or you can even
exec such a mailer program directly in the sendMail procedure.
If the host on which vovnotifyd runs has a local MTA which relays to the domain's MTA which requires authentication or TLS, the local MTA may accept unauthenticated cleartext SMTP. If this is the case, you may be able direct SMPT with 'localhost' as the mailer host.
Health Checks
This is a list and short description of all built in health checks monitored by
notifyd
.
AllJobsFailedOnHost
- This check will check for hosts on which all jobs have failed during a given time window (default 3600)
CheckAlerts
- This check will send a summary of all the current alerts via notification email
CheckDownTaskers
- This check will check for taskers enabled in taskers.tcl but not running
CheckJobsReqstRam
- This will alert you for jobs that have used more RAM than requested.
CheckTaskerset
- This will check the status of running taskers.
CheckVendorLicenseExpiration
- This will alert you when a vendor license is going to expire in a few days.
Daemons
- This will check the daemons.
FalseLicenceUsage
- This will alert you when jobs use licenses that have not been declared
JobStuck
- This will find stuck jobs, which are running but burning no CPU. The default check frequency on this is 2 hours.
JobsWaitingForTooLong
- Just as the name suggests, this reports on jobs that have been waiting for too long.
LongJobs
- This will check for jobs that run too long.
LongJobsNoCpu
- Like the previous check, but will check for jobs that are not using any CPU.
RamSentry
- This will check jobs under Ram Sentry protection.
ServerSize
- This will check the growth of vovserver.
TooManyFailures
- As the name suggests, this will check for a user submitting too many failed jobs.
TooManyOutOfQueueJobs
- This will check for too many jobs being run outside of Altair Accelerator.