Configure a Failover Server Replacement

If a server crashes suddenly, VOV has the capability to start a replacement server on a pre-selected host. This capability requires that the pre-selected host is configured as a failover server.

The configuration instructions follow.

Note: The vovserverdir command only works from a VOV-enabled shell when the project server is running.

Edit or create the file servercandidates.tcl in the server configuration directory. Use the vovserverdir command with the -p option to find the pathname to this file.
```
% vovserverdir -p servercandidates.tcl
/home/john/vov/myProject.swd/servercandidates.tcl
```
The servercandidates.tcl file should set the Tcl variable ServerCandidates to a list of possible failover hosts. This list may include the original host on which the server was started.
```
set ServerCandidates {
    host1
    host2
    host3
}
```

Install the autostart/failover.csh script as follows:

% cd `vovserverdir -p .`
% mkdir autostart
% cp $VOVDIR/etc/autostart/failover.csh autostart/failover.csh
% chmod a+x autostart/failover.csh

Activate the failover facility by running vovautostart.
```
% vovautostart
```
For example:
```
% vovtaskermgr show -taskergroups
ID         taskername        hostname         taskergroup
000404374  localhost-2      titanus          g1
000404375  localhost-1      titanus          g1
000404376  localhost-5      titanus          g1
000404377  localhost-3      titanus          g1
000404378  localhost-4      titanus          g1
000404391  failover         titanus          failover
```
Note: Each machine listed as a server candidate must be a vovtasker machine; the vovtasker running on that machine acts as its agent in selecting a new server host. Taskers can be configured as dedicated failover candidates that are not allowed to run jobs by using the -failover option in the taskers definition.

Preventing jobs from running on the candidate machine eliminates the risks of machine stability being affected by demanding jobs. The -failover option also enables some failover configuration validation checks. Finally, failover taskers are started before the regular queue taskers, which helps ensure a failover tasker is available as soon as possible for future failover events.

Refer to the tasker definition documentation for details on the -failover option.

How vovserver Failover Works

If the vovserver crashes, after a period of time, the vovtasker process on each machine notices that it has had no contact from the server, and it initiates a server machine election.

In this election, each vovtasker votes for itself (precisely, the host that this particular tasker runs on) as a server candidate. The election is conducted by running the script vovservsel.tcl.

After the time interval during which the vovtaskers vote expires, (default 60 seconds) the host that appears earliest on the list will be selected to start a new vovserver.

In the following example, the servercandidates.tcl, file contains three hosts:

set ServerCandidates {
    host1
    host2
    host3
}

When the server crashes, if there are vovtaskers running on host1, host2 and host3, then these hosts will be voted as server candidates. Then host2 will be the best candidate and a new vovserver will be started on host2. This server will start in Crash Recovery Mode.

Note: For failover recovery to be successful, an active vovtasker process must be running on at least one of the hosts named in the ServerCandidates list. Usually, these vovtaskers have been defined with the -failover option so they can not accept any jobs, and are members of the failover taskergroup.

The failover vovserver will read the most-recently-saved PR file from the .swd/trace.db directory, and then read in the transactions from the 'cr*' (crash recovery) files to recover as much of the pre-crash state as possible.

The vovserver writes a new serverinfo.tcl file in the .swd that vovtaskers read to determine the port and host. When it starts, the failover vovserver appends the new host and port information to the $NC_CONFIG_DIR/<queue-name.tcl> as well as to the setup.tcl in the server configuration directory. The vovserver then runs the scripts in the Autostart Directory. This should include the failover.csh script, which resets the failover directory so that failover can repeat. This script removes the registry entry, and removes the server_election directory and creates a new empty one. At the end, it calls vovproject reread to force the failover vovserver to create an updated registry entry.

The failover vovserver remains in crash recovery mode for an interval, usually one minute, waiting for any vovtaskers that have running jobs to reconnect:

For Accelerator, Accelerator Plus, Monitor and Allocator, vovtaskers wait up to 4 days for a new server to start.
For FlowTracer, vovtaskers wait up to 3 minutes for a new server to start.

After reconnecting to vovserver, vovtaskers automatically exit after all of their running jobs are completed. After the vovserver transitions from crash recovery mode to normal mode, it will try to restart any configured vovtaskers that are not yet running.

Any of the following conditions will prevent successful failover server restart:

The filesystem holding the .swd directory is unavailable.
The file servercandidates.tcl does not exist.
The ServerCandidates list is empty.
There is no vovtasker running on any host in the ServerCandidates list when the server crashes.
The autostart/failover.csh script file is not in place.

In this case, the failover server will not be automatically started; the server will have to be manually started.

Tips for Configuring Failover

Following are tips for failover configuration:

Make the first failover host the regular one. This way, if the vovserver dumps core or is killed by mistake, it will restart on the regular host.
Configure special vovtaskers only for failover by passing the -failover option to vtk_tasker_define.
Test that failover works before depending on it.

Migrating vovserver to a New Host

The failover mechanism provides the underpinnings of a convenient user CLI command that can be used to migrate vovserver to a new host:

ncmgr rehost -host  NEWHOST

The specified NEWHOST must be one of the hosts eligible for failover of vovserver.

Crash Recovery Mode

Crash Recovery Mode is activated the next time the server is restarted, if the server was not shut down cleanly. Crash Recovery Mode is part of the Failover Server capability of VOV, which is mainly used in Accelerator. This capability allows VOV to start a new server to manage the queue of a server which has crashed unexpectedly.

When you shut down an Accelerator instance cleanly using the ncmgr stop command, the server will save its database to disk just before exiting. When the server is restarted, it will read the state of the trace from disk, and immediately be ready for new work.

Sometimes the vovserver will be stopped unexpectedly, such as due to a hardware problem like a machine crash or memory exhaustion. In such cases, the server will not have a chance to save the project database before terminating.

In VOV, the main concern is usually the state of the trace, which stores the status all the jobs in your project.
In Accelerator, there is no trace, and the important thing to preserve is the state of the queue, so jobs do not lose their position and need to be re-queued in the case of a server crash.

Journal Files

The vovserver keeps crash recovery journal files of the events that affect the state of the server. These 'CR' files are flushed whenever the trace data are saved to disk. During crash recovery, the vovserver first reads the last saved state of the trace from the disk data, then applies the events from the CR files.

Crash Recovery Restart

When the server is next restarted after such a crash, the server enters what is called Crash Recovery Mode, which usually lasts about two minutes, but may take longer if the CR files are very large. During this period:

The server waits for vovtaskers with running jobs to reconnect.
No jobs are dispatched.
The server does not accept VOV or HTML TCP connections from vovsh or browser clients.
At the end, the server performs a global sanity check.
A crash_recovery_report <timestamp> logfile is written. It logs any jobs lost by the crash recovery sequence.

If the server is properly shut down, the next time it restarts, it will not enter Crash Recovery Mode, and will be immediately functional.

Note: If the sanity check command cannot connect, the server is still recovering its state from the DB and CR files. Check the size of the vnc.swd/trace.db/CR* files.

Example commands:

% vovproject enable <project-name>
% vovproject sanity