NFS Cache

NFS caches some information about files and directories and this may cause corrupt data in some flows where multiple machines are used, because one of the machines may happen to use cached data that is not up to date, or may fail to see a file that has been just created on some other machine.

To minimize the failures caused by the NFS caches, the VOV wrapper have the ability to check that the timestamps of all inputs match those that are expected by the vovserver.

To activate this functionality, use one of the following methods:
  • Use the option -n
  • Define the variable VOV_VW_NFS_PROTECTION
Examples:
% vw -n cp aa bb
% env VOV_VW_NFS_PROTECTION=1 vw cp aa bb
Here is a schema to try to explain the NFS cache problem better (and the need to wrap in vov):


Figure 1.

Each UNIX host maintain its own local time, and an NFS cache which has the timestamps of the files. The cache stays valid for a few seconds before a file info is expired.

As long as we stay within the same machine, the file timestamps are correct to all the processes on that one machine. It is only when you have two processes back-to-back each running on a different machine that you may experience problems.

Let's first talk of a flow with two wrapped jobs: "jobone" and "jobtwo" which runs on lnx11 and lnx22. The vovtasker is just a remote execution process. It does not check any timestamps. It just cd to the directory, sets the environment, and starts the command line (a vov mytool).

When the vov wrapper wrapping jobone starts on lnx11:
  1. It checks the inputs of jobone
  2. It runs jobone
  3. After the process jobone completes, it rechecks the inputs of jobone (they should not have changed) and then checks the outputs (they must all exist with timestamps greater than start of the job)

All this happens within one machine (lnx11) so we're safe.

Then the vov wrapper on lnx22 is starting jobtwo:
  1. It checks the inputs of jobtwo
  2. It runs jobtwo
  3. After the process jobtwo completes, it rechecks the inputs of jobtwo (they should not have changed) and then checks the outputs (they must all exist with timestamps greater than start of the job)

When VOV_VW_NFS_PROTECTION is defined, the vov/vrt/vw wrappers wait 10 seconds and retry 6 times before finally exiting with an error if the problem is still present. That 60 seconds wait is sufficient for all the caches to synchronize over the network.

When the job is not wrapped (like jobthree on lnx07), the timestamps are checked by the vovserver itself. When the vovserver is on a different machine, it is exposed to NFS cache bugs. Always wrap your jobs with a vov or a vrt or vw if you want to avoid NFS cache problems. FlowTracer cannot wrap all jobs by default as it does not know which dependency method is required by the user, or the user may want to run extremely short jobs on the same machine with no overhead penalty.