This method is usually implemented with a physical overhaul, a part of a major IT
event. This kills all running jobs, which can be highly disruptive.
To reduce the impact, Accelerator can be instructed to stop
accepting and dispatching new jobs for a period prior to the shutdown event,
enabling some of the running jobs to complete. How many jobs will complete depends
on the jobs' duration and the time allowed before shutdown.
-
Download the Accelerator upgrade software.
-
Install the new software.
-
Using the new version, create a separate, temporary test queue (to validate the
new version while production continues).
-
Validate the installation using the test queue that you created.
-
Schedule and announce the upgrade.
-
If you have multiple Accelerator queues, it is recommended
that you set thNC_QUEUE to
the name of the queue that is undergoing maintenance. This helps prevents
accidentally shutting down the wrong queue. Use the command:
setenv NC_QUEUE vncNameOfQueue
- Optional:
Suspend the vovtaskers with the command
below. This command puts the vovtaskers in
the SUSP state; running jobs will continue, but vovtasker will not accept new jobs. When vovtasker completes its current set of jobs, it will exit.
- Optional:
At the point of the scheduled downtime, document the IDs of running jobs as
those jobs will be terminated forcefully. This list can be used to inform users
that their jobs were terminated by the maintenance event.
nc list -r -a -O @ID@ @USER@ @COMMAND@
- Optional:
To automatically identify jobs when the queue is restarted, place the jobs in a
special set.
Example:
nc cmd vovset create "ImpactedByQueueRestart" "isjob status==RETRACING"
- Optional:
Terminate these running jobs with the -force option, which
should terminate the remaining taskers within a few minutes.
Example:
vovtaskermgr stop -force -all
-
Stop the vovserver of the queue with the
command ncmgr stop
-
Proceed with any necessary infrastructure maintenance.
-
Restart the queue.
-
Ensure that your shell is configured to support the correct number of file
descriptors.
Note: This value cannot be changed after starting the queue.
-
Ensure that you are pointing to the appropriate version of Accelerator with the command which
nc.
-
Start the queue with the command ncmgr start.
A confirmation dialog will open.
-
Review the parameters carefully (especially number of file descriptors) before
replying 'yes'. (Starting the queue will automatically start the taskers but this will take some time, be patient.)
-
Validate that jobs are dispatching normally.
Note: Restarting a large compute farm will take several minutes.
- Optional:
Re-queue the jobs that were impacted by the shutdown. Use the following
command:
nc rerun -f -set ImpactedByQueueRestart