Troubleshooting
% vovcheck
vovcheck: message: Creating report /usr/tmp/vovcheck.report.15765
Test: BasicVariables [OK]
Test: EnvBase [OK]
Test: FlowTracerLicense [OK]
Test: FlowTracerPermissions [OK]
Test: GuiCustomization [OK]
Test: HostPortConflicts [ERROR]
Test: Installation [OK]
Test: OldMakeDefault [OK]
Test: Rhosts [OK]
Test: Rsh [OK]
Test: SecurityPermissions [WARN]
Test: TaskerRoot [OK]
Test: WritableLocal [WARN]
Test: WritableRegistry [OK]
Test: vovrc [OK]
vovcheck: message: Detailed report available in /usr/tmp/vovcheck.report.15765
The Server Does Not Start
- Make sure you have a valid RLM license.
Type:
% rlmstat -a
- Check if the server for your project is already running on the same machine.
Do not start a Accelerator Plus project server more than once.
For example, you can try:
% vovproject enable project % vsi
- Check if the server is trying to use a port number that is already used by another vovserver or by another application. VOV computes the port number in the range [6200, 6455] by hashing the project name. If necessary, select another project name, or change host, or use thVOV_PORT_NUMBER to specify a known unused port number. The best place to set this variable is in the setup.tcl file for the project.
- Check if the server is trying to use an inactive port number that cannot be
bound. This can happen when an application, perhaps the server itself,
terminates without closing all its sockets. The server will exit with a message similar to the following:
...more output from vovserver... vs52 Nov 02 17:34:55 0 3 /home/john/vov vs52 Nov 02 17:34:55 Adding licadm@venus to notification manager vs52 Nov 02 17:34:55 Socket address 6437 (net=6437) vs52 ERROR Nov 02 17:34:55 Binding TCP socket: retrying 3 vs52 Nov 02 17:34:55 Forcing reuse... vs52 ERROR Nov 02 17:34:58 Binding TCP socket: retrying 2 vs52 Nov 02 17:34:58 Forcing reuse... vs52 ERROR Nov 02 17:35:01 Binding TCP socket: retrying 1 vs52 Nov 02 17:35:01 Forcing reuse... vs52 ERROR Nov 02 17:35:04 Binding TCP socket: retrying 0 vs52 Nov 02 17:35:04 Forcing reuse... vs52 ERROR Nov 02 17:35:04 PROBLEM: The TCP/IP port with address 6437 is already being used. POSSIBLE EXPLANATION: - A VOV server is already running (please check) - The old server is dead but some of its old clients are still alive (common) - Another application is using the address (unlikely) ACTION: Do you want to force the reuse of the address?
In this case:
- List all VOV processes that may be running on the server host and
that may still be using the port. For example, you can use:
% /usr/ucb/ps auxww | grep vov john 3732 0.2 1.5 2340 1876 pts/13 S 17:36:18 0:00 vovproxy -p acprose -f - -b john 3727 0.1 2.2 4816 2752 pts/13 S 17:36:16 0:01 vovsh -t /opt/rtda/latest/linux64/tcl/vtcl/vovresourced.tcl -p acprose ...
- You can wait for the process to die on its own, or you can kill it,
for example with
vovkill
% vovkill pid
- Restart the server.
- List all VOV processes that may be running on the server host and
that may still be using the port. For example, you can use:
- You run the server as the Accelerator Plus administrator user. Please check the ownership of the file security.tcl in the server configuration directory vwx.swd.
The UNIX Taskers Do Not Start
rsh
or
ssh
. - If using
rsh
try the following:
where host is the name of a machine on which there are problems starting a tasker.% rsh host vovarch
This command should return a platform dependent string (such as "linux") and nothing else. Otherwise, there are problems with either the remote execution permission or the shell start-up script.
- If the error message is similar to "Permission denied", check the file .rhosts in your home directory. The file should contain a list of host names from which remote execution is allowed. You may have to work with your system administrators to find out if your network configuration allows remote execution.
- If using
ssh
, perform the test above but usessh
instead ofrsh
. For more details aboutssh
check SSH Setup. - If you get extraneous output from the above command, the problem is probably
in your shell start-up script. If you are a C-shell user, check your
~/.cshrc file. Following are guidelines for a
remote-execution-friendly .cshrc file:
- Echo messages only if the calling shell is interactive. You can test
if a shell is interactive by checking the existence of the variable
prompt, which is defined for interactive shells. Example:
# Fragment of .cshrc file. if ( $?prompt ) then echo "I am interactive" endif
- Many .cshrc scripts exit early if they detect a
non-interactive shell. It is possible that the scripts exit before
sourcing ~/.vovrc, which causes Accelerator Plus to not be available in non-interactive
shells. Compare the following fragments of
.cshrc files and make sure the code in your
file works properly: The following example will not work properly for non-interactive shells:
if ( $?prompt ) exit source ~/.vovrc
This example is correct; source .vovrc and then check the prompt variable:source ~/.vovrc if ( $?prompt ) exit
This example is also correct:if ( $?prompt ) then # Define shell aliases ... endif source ~/.vovrc
- Do not apply
exec
to a sub-shell. This will cause thersh
command to hang.# Do not do this in a .cshrc file exec tcsh
- Echo messages only if the calling shell is interactive. You can test
if a shell is interactive by checking the existence of the variable
prompt, which is defined for interactive shells. Example:
License Violation
Accelerator Plus is licensed by restricting the number of tasker slots. This is the sum of the tasker slots from both elastic and statically defined taskers (if any).
% rlmstat -avail
The file $VOVDIR/../../vnc/vwx.swd/taskers.tcl defines the list of static taskers that are managed by the server. Make sure the number of tasker hosts is within the license capability.
Crash Recovery
In the event of a crash or failover, you can find a checklist of what to do at http://wx-host:wx-port/cgi/sysrecovery.cgi.
wx cmd vovbrowser -url /cgi/sysrecovery.cgi
When Things Go Wrong
The following may occur after a major infrastructure event, a problem with submission scripts or a change to the workload or a change to things like Limits.
elastic
daemon
has stopped. Check the elastic daemon
log.- If the timestamp is not fresh, you will need to restart:
-
- Save off the existing log file (and send that file to Altair for diagnosis).
- Restart:
nc -f $WxQueueName cmd vovautostart
- If elastic daemon is running:
-
- Check that tasker jobs are being submitted and that they are being executed by the base scheduler. To do this, connect to the correct Accelerator cluster through the web browser, locate the job set called vovelasticd. This set contains other sets, one set for each Accelerator Plus session.
- Locate the appropriate set for your Accelerator Plus session. Look at the name of the set.
- If you see only cyan (scheduled) jobs, the problem is that the base scheduler cannot schedule these tasker jobs. You need to debug why these jobs are not being run by the underlying scheduler.
- If the jobs are getting run but keep failing (turning red) then debug and determine the reason for those failures.
When Jobs Are Not Running
You may discover that a tasker job is asking for an impossible resource. This is often due to an error in the resource requirements, or an error in the configuration. You fix the problem but Accelerator Plus continues to not run jobs. In this case, those "impossible" jobs are still seen by the base scheduler (and are therefore not runnable) but also they are seen by the elastic daemon, which assumes that these jobs are runnable and that no new jobs should be submitted: a live-lock scenario.
The recommendation here is to dequeue any queued jobs in the base scheduler after changing the problematic resource request. This lets the elastic daemon launch replacement jobs.
A different scenario is tasker jobs being scheduled, dispatched and running in the base but no jobs are getting executed inside the Accelerator Plus session. The symptom is a growing list of green nodes (valid) with a small number of orange (running) and cyan (scheduled) jobs. In this case, the base is dispatching the tasker jobs without a problem, but Accelerator Plus is not making use of them. The results: they execute, waiting for the end user job that never comes until they hit their maximum idle limit.
For this case, we recommend checking the Accelerator Plus session using
either a vovconsole (wx -q $WxQueueName cmd vovconsole)
or an
Accelerator Plus monitor. If you see no activity in the LED bar
(taskers connecting, waiting and then terminating - often
yellow-to-green-to-black), most likely there is a problem with the configuration.
The taskers that the base scheduler is executing are not
connecting back to the desired Accelerator Plus session.
- If the LED monitor is active, then taskers are connecting
and the failure to launch is mostly likely in Accelerator Plus. A common problem is that the job requests a limit or a special resource:
the limit must be satisfied in Accelerator Plus and the base
scheduler. In this case, elastic daemon tries to detect expandable limits
sucy
asLimit:foo_@USER@_N
and will set the limit to be unlimited in Accelerator Plus, and then pass the limit request on to the base scheduler where it is honored. Occasionally, this process goes wrong. - If the limit does not exist or is set to 0 in Accelerator Plus, then the job will not launch. In this state, the job will appear to be queued (cyan color), but you will not see a bucket created for it (wx mon can be used to help diagnose that case). To add the missing resources edit resources.tcl. In general, set the resource value to "unlimited" in Accelerator Plus; the restrictive value in the base scheduler will still be honored.
What to Check When Jobs are Not Being Submitted
Accelerator Plus reports the tasker as SICK and it has a missing heartbeat. Check in the base scheduler and see that the tasker has been suspended. In this case someone or something has suspended the tasker - both it and it's underlying job are suspended (T-state/CTRL-Z). Find out why - it could be a user or RAM sentry - the tasker job may have some property information telling you what. When resumed the tasker will become healthy (i.e. non SICK) and continue normally within the base scheduler and Accelerator Plus.
Everything seems fine but jobs are not being submitted. Check the base scheduler and if you find that the tasker jobs aren't running due to FairShare reasons then it may be possible there's a regression that is using the same FairShare node and subgroup (either from Accelerator Plus or native in the base scheduler). In this case the jobs will be treated in a first-come-first-served basis.
Avoid Suspending Accelerator Plus Taskers in the Base Scheduler
Elastic tasker jobs in the base scheduler should not be subjected to suspension events from users or the preemption system. Accelerator Plus supports a form of preemption called modulation. In this case, the base scheduler queue requests the Accelerator Plus tasker to terminate on the next Accelerator Plus job boundary.