vovtaskermgr
The main way to start, configure, and stop the taskers is with the vovtaskermgr command. This command acts relative to the VOV-project enabled in the shell where it is issued.
A vovtasker listed in the taskers.tcl file may be running or stopped. The show subcommand gives information on the running vovtaskers currently connected to the vovserver. The list subcommand gives the names of all the vovtaskers defined in vovtaskers, whether running or stopped.
vovtaskermgr: Usage Message
USAGE:
vovtaskermgr <SUBCOMMAND> [options] [taskerList]
SUBCOMMAND is case-insensitive.
The taskerList consists of tasker names or tasker ids.
SUBCOMMAND is one of:
LIST -- List all hosts named in the taskers.tcl file.
RESTART -- Same as STOP followed by START.
REFRESH -- Refresh cached environments and equivalences.
The default behavior is for taskers to obtain the
equivalences from the server. If changes are made to the
equiv.tcl file, the server will need to be instructed to
reread the file using the "vovproject reread" command
prior to requesting a tasker refresh.
If VOVEQUIV_CACHE_FILE is set to "legacy", a host-based
equivalence cache file will be created and updated in
the SWD/equiv.caches directory. If VOVEQUIV_CACHE_FILE
is set to a file path, the specified file will be used
instead.
SHOW -- Show info about connected or down taskers.
PRINTSTATUS -- Tell taskers to print their status in their log file.
START -- Start configured taskers. If a list of hosts is
given, start taskers only on those hosts. Otherwise,
start all configured taskers that are not running.
UPDATE -- Update configuration of running taskers.
RESERVE -- Reserve specified taskers.
RESERVESHOW -- Show current tasker reservations.
CONFIGURE -- Reconfigure the specified taskers on-the-fly.
Changes only persist until the tasker is stopped.
STOP -- Stop taskers; let jobs finish, unless -force is given.
CANCELSHUTDOWN -- Revert stopped but still running taskers to normal
so they continue running and accept new jobs.
ROTATELOG -- Recreate new log files for specified taskers
if log files are missing, create tasker log directories
if needed, and have no impact on tasker startup logs.
CLOSE [MSG] -- Close taskers from accepting jobs. Closed taskers will
start and run, but will do so in a suspended state,
displaying the closure message, until opened by the
administrator. The default closure message is
'Closed by administrator'.
OPEN [MSG] -- Open taskers to accept jobs. The accompanying message
will be displayed on running taskers until another
message is generated during the course of normal
operation. Taskers that are not running will not display
the message after starting. The default opening message
is an empty string.
Global Options are:
-l -- Use longer format with LIST (may be repeated).
-v -- Increase verbosity of messages.
-cfgfile -- Specify path to tasker config file, relative to SWD.
Default: taskers.tcl
-failover -- Restrict operation to dedicated failover taskers only.
Options for SHOW are:
-nameonly -- Show only the names of the connected taskers.
-nameid -- Show only the names and ids of the connected taskers.
-resourceonly -- Show only the resources of the connected taskers.
-down -- Show names of configured taskers that are down.
-license -- Show licensed capabilities of connected taskers.
-taskergroups -- Show tasker group for each connected tasker.
Options for START and RESTART are:
-server -- Start the taskers by rsh/ssh from the vovserver host.
By default, the taskers are started
by the host that executes this script.
-random -- Start taskers in random order.
This is useful to start a large pool of tasker,
by running multiple concurrent commands like:
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
-nolog -- Redirect tasker output to /dev/null.
Useful to avoid huge log files in /usr/tmp
-confirmafter <TIMESPEC>
-- Wait for the given time specification after the last start
request for the list of taskers being started, then print
whether each tasker has successfully started and connected
to the vovserver. Only taskers in the READY, WRKNG, FULL, or
OVRLD state will be considered as running.
Options for RESERVE are:
-user -- Reserve the tasker(s) for given list of users
(comma separated list)
-usergroup -- Reserve the tasker(s) for given list of user groups
(comma separated list)
-group -- Reserve the tasker(s) for given list of FairShare groups
(comma separated list)
-jobclass -- Reserve the tasker(s) for given list of jobclasses
(comma separated list)
-jobproj -- Reserve the tasker(s) for given list of job projects
(comma separated list)
-osgroup -- Reserve the tasker(s) for given list of Unix groups
(comma separated list)
-bucketid -- Reserve the tasker(s) for given list of queue buckets
(comma separated list)
-id -- Reserve the tasker(s) for given list of jobs
(comma separated list of job ids)
-start -- Reservation start time
-end -- Reservation end time
-duration -- Reservation duration (VOV timespec)
-cancel -- Cancel the reservation on tasker(s)
-hardfill -- Backfill the reservation with only
autokill jobs
-softfill -- Backfill the reservation with only
autokill or xdur jobs
Options for STOP are:
-force -- Stop taskers with force. BEWARE: kills running jobs.
-noconfirm -- Do not prompt for confirmation. Default is to prompt.
-all -- Stop all running taskers.
-sick <TIMESPEC>
-- Stop all taskers that have been sick for at least the given
time specification, as compared against the last time a
heartbeat was received by the server for each sick tasker.
All jobs running on a sick tasker being stopped will be
marked as failed in the server, even if the job does,
or has, completed successfully while the tasker is sick.
It is recommended to check tasker host connectivity before
using this function and allow for the tasker to reconnect
and send a heartbeat in case connectivity is restored.
Parameters for CONFIGURE are:
-allowcoredump <bool> -- Control core-dump behavior.
-autokillmethod <d|n|v> -- Control autokill method.
-capacity <CAP>[MAXCAP] -- Specify capacity and optionally the
max-capacity of the tasker. The capacity is
the maximum number of jobs that can be run by
tasker. The max_capacity is the maximum slots
a tasker can be expanded to have when jobs are
suspended. The default value for capacity is
equal to the number of CORES present. The
default value for max_capacity is 2*CAPACITY.
Use N, N/N, CORES[-+*/]N, CORES[-+*/]N/N,
N/CORES[-+*/]N, CORES[-+*/]N/CORES[-+*/]N to
make adjustments from the default.
Examples: 4, 4/8, CORES-2, CORES*0.8,
CORES+0/20, CORES+2/CORES*2
-cpus <N> -- Number of CPUs in this machine.
-debugcontainers <bool> -- Enable debug logging of container activity.
-debugjobcontrol <bool> -- Enable debug logging of job control activity.
-debugmultienv <bool> -- Enable debug logging of environment switching.
-debugnuma <bool> -- Enable debug logging of NUMA activity.
-debugusageinfo <bool> -- Enable debug logging of memory usage analysis.
-maxload <MAXLOAD> -- Maximum load above which new jobs are refused.
The default value for max_load is
CAPACITY+0.5.
Use 0 or less than 0 to specify default value.
Use N or CAPACITY[-+*/]N to make adjustments
from the default.
Examples: 12.0, CAPACITY+2, CAPACITY*2
-maxwaitnostart <N> -- How long to wait for a job to start.
-maxwaittoreconnect <N> -- How long to wait before reconnect.
-message <string> -- Set vovtasker message.
-numabindtonode <bool> -- Bind to entire NUMA node or individual cores.
Default is to bind to entire NUMA node.
-resources <string> -- vovtasker resources.
-taskergroup <string> -- The tasker group.
-minramfree <N> -- Minimum amount of free RAM in MB.
-name <string> -- Name of vovtasker.
-ramsentry <bool> -- Activate/Deactivate RAM SENTRY.
-efftotram <N> -- Effective total RAM in MB.
-retrychdir <N> -- Specify number of retries for failed chdirs.
-retrychdirsleep <N> -- Specify the sleep interval time between
retries for failed chdirs.
-retrychdirbackoff <N> -- Specify the factor multiplied to the sleep
interval to increase sleep interval between
retries for failed chdirs.
-liverecorder on|off -- Enable/disable LiveRecorder debugging
capability (linux64 only).
-liverecorder.logdir <string>
-- Specify the directory in which the LiveRecorder
recording file should be saved. The
directory must exist. Default is "/tmp".
-liverecorder.logsize <N> --
Specify the LiveRecorder log size in MB.
Default: 256, Min: 256, Max: 65536.
-liverecorder.mode <string>
-- Specify the LiveRecorder mode, which is one of
the following: tasker, subtasker, both.
Note that enabling subtasker recording results
in a recording file for each job executed on
the tasker.
Default: tasker.
-rawpower -- Specify a raw power figure for initial tasker
startup.
-mindisk -- Specify minimum /tmp disk in MB or
percentage (0%-99%, for example, 10%)
required for tasker startup.
-coeff -- Specify a scaling factor from 0.01-100.0
used to derate tasker power.
-sendenv <name> -- Send a named environment to a tasker.
-setenv VAR=VALU E -- Set a variable in the tasker environment.
("VAR=VALUE" must be quoted on Windows)
-taskerheartbeat <N> -- Specify the heartbeat for a tasker.
-unsetenv VAR -- Unset a variable in the tasker environment.
-hardbound <bool> -- Dispatch autokill jobs only.
-softbound <bool> -- Dispatch autokill or xdur jobs only.
EXAMPLES:
% vovtaskermgr show
% vovtaskermgr show -nameid
% vovtaskermgr start
% vovtaskermgr start unix1
% vovtaskermgr start -random -- Start taskers in random order.
% vovtaskermgr update
% vovtaskermgr restart
% vovtaskermgr stop -- Stop all taskers, let running
jobs finish.
% vovtaskermgr stop -noconfirm -- Like above, no confirmation
required.
% vovtaskermgr stop -force -- Kill running jobs now
(-noconfirm implied).
% vovtaskermgr reserve -user john \\
-duration 3h jupiter -- Reserve tasker jupiter for user
john for 3h from now
% vovtaskermgr configure -message "shutdown 1PM" farm11 farm12
% vovtaskermgr printstatus farm11
% vovtaskermgr rotatelog -- Recreate missing log files for
all connected taskers
% vovtaskermgr rotatelog farm2 farm11 -- Recreate missing log files for
tasker farm2 farm11
% vovtaskermgr configure jupiter -sendenv BASE
-- send the BASE environment to
tasker jupiter
% vovtaskermgr reserve -hardfill -user john \\
-duration 3h jupiter -- Reserve tasker jupiter for user
john for 3h from now, and backfill with autokill jobs
% vovtaskermgr configure jupiter -softbound 1
-- Reserve tasker jupiter for jobs with autokill or xdur specified
Starting Many Taskers in Parallel
If you have hundreds of taskers to start, it may take some time. You can speed up the process by running multiple start script with the -random option, which is useful to start taskers in random order.
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
Tasker Configuration on the Fly
vovtaskermgr configure
. For example,
you can change the capacity of a tasker, i.e. the maximum number
of jobs that the tasker can take, with:
% vovtaskermgr configure -capacity 8 pluto
% vovtaskermgr configure -capacity 0 pluto
% vovtaskermgr configure -message "Temporarily disabled by John" pluto
Tasker Capacity
The behavior of manually overriding vovtasker cores and capacity has been improved. By default, the capacity follows the core count, but it can also be manually set via the -T option or by defining the SLOTS/N consumable resource via the -r option, where N is a positive integer. In all cases, the capacity directly affects the number of slot licenses that will be requested.
Tasker Reservation
Below is an example of using vovtaskermgr to set a reservation on a tasker. In this case, you want to reserve the tasker called 'pluto' for user 'john' for 2 days.
% vovtaskermgr reserve -user john -duration 2d pluto