Rice University
This file describes in detail the checkpointing scheme that was implemented by K. Shriram and A. A. Schaffer. Checkpointing means periodically saving the state of a computation. The purpose of checkpointing is to be able to recover from a crash of the underlying computer that causes one of the FASTLINK programs to stop for a reason that has nothing to do with its computation. Two common causes for such crashes are power failures and lightning hits. Right now checkpointing works only for the sequential versions of FASTLINK on UNIX, and for MLINK and LINKMAP on VMS.
After seeing the checkpointing scheme in LODSCORE and ILINK for versions 2.0 and 2.1, several users who had suffered machine crashes during LINKMAP runs clamored for extending the scheme to the other two programs. As of version 2.2, all four programs have checkpointing and crash-recovery.
Through version 1.1, FASTLINK provided the same level of functionality as LINKAGE 5.1. Checkpointing adds new functionality, so we decided to write more detailed documentation about the checkpointing facility. Any questions, comments, or complaints should be directed to Alejandro Schaffer (schaffer@cs.rice.edu).
Frequent LINKAGE users almost certainly have had the computer crash during a long run, only to have to start the computation again. We have now included a "checkpointing" package in the code that occasionally saves the state of the computation, so that a crashed program can be restarted without much computation lost. The folklore wisdom seems to be that this form of augmentation to programs is the proper mechanism for recovering from crashes. This file briefly the checkpointing process and explains the files connected with our implementation.
There are standard packages that do checkpointing of programs for specific operating systems, but we wanted our code to be somewhat portable because LINKAGE is used on a variety of operating systems.
Unless otherwise specified, the descriptions that follow apply equally to all of ILINK,LODSCORE, LINKMAP, and MLINK. In particular, to distinguish the two, we use the names of the programs in the filenames. We shall annotate this by the string "<>", which should be replaced by the program name in question. Thus, for instance, the filename `checkpoint<>.bak' would denote `checkpointILINK.bak' or `checkpointLODSCORE.bak', depending upon context.
Before getting into details, there are three VERY IMPORTANT cautions in using the FASTLINK crash-recovery scheme.
rm checkpoint* script* outf* main*Note: extreme care should be taken when removing these files that you don't have other meaningful files in the same directory with any of these prefixes. If "rm" doesn't normally prompt you for each file before removing it, it is probably wiser to delete these files by hand.
The programs ILINK and LODSCORE perform checkpointing at two distinct types of locations. A checkpoint is created at the start of each iteration (in the function iterate()); it is also made at the beginning of the functions initialize(), outf(), firststep(), decreaset() and increaset(), and at the beginning of the loops in gforward() and gcentral(). We distinguish between these two types by the terms "iteration-" and "function-checkpoint", respectively; the latter term is used since the program proceeds to make one or more calls to the routine fun() shortly after the location of checkpointing.
In the case of LINKMAP a simple checkpoint is taken after each likelihood function evaluation. MLINK is the same except we do not checkpoint on the first function evaluation where the moving marker is unlinked to the others.
The files final.dat and stream.dat (if requested) primarily contain the output, so a checkpointing mechanism must take care to ensure the contents of these files are not altered in any way by the process. In ILINK and LODSCORE All output to these files takes place in the routine outf() (and from the routines it calls); hence, these files are checkpointed before entry into outf(). More details on this follow under the discussion of the actual files created. In MLINK and LINKMAP these files are updates after each function evaluation, so they have to checkpointed as well.
It is important to ensure that none of these files are present at the start of a fresh run; however, do not delete any of these after a run has begun, and especially when trying to recover from a previous run.
NOTE: Please note that the file protections set by the program may not be what you desire. These can be changed by altering the value of the variable CopyAppendPerms in the file checkpointdefs.h, where the value specified should be as given to the chmod(1) command. (The additional leading `0' is essential; it makes the value that follows to be treated as a constant in octal, as required by chmod(1).)
checkpoint.<> text, binary
For ILINK and LODSCORE: This file is written at two types of places, namely an iteration- and a function-checkpoint. Only three parts of this file are in text mode; they are:
Following these are the bytes that constitute the actual values being stored; these are in an architecture-dependent binary format.
Finally, the end-marker provides us with a means of partially checking for the integrity of the data written in the checkpoint. For LINKMAP and MLINK only some counters indicating how many function evaluations are complete need to be stored in this file.
checkpoint.<>.bak text, binary
When a checkpoint is to be written and a checkpoint file is already found, the existing file is moved to this backup name and the new one is written in its place. The main purpose of doing this is to increase security against crashes: should the crash have damaged the checkpoint file but have left the backup untarnished, the backup may be copied into the checkpoint and computation can be resumed, even if from a slightly earlier stage in the run. The format of this file is the same as that of the checkpoint file, which is copied into the backup without modification.
outf.LODSCORE.stream.dat text outf.ILINK.stream.dat text main.LINKMAP.stream.dat text main.MLINK.stream.dat text outf.ILINK.final.dat text main.LINKMAP.final.dat text main.MLINK.final.dat text outf.LODSCORE.recfile.dat textThese files are created by the subroutine outf() or main(). Their purpose is to maintain copies of the files stream.dat and final.dat (for ILINK, LINKMAP, or MLINK) or recfile.dat (for LODSCORE), respectively, so that if recovery needs to take place after these files have been written to, the two files can be restored to the state they had.
script.<>.final.out text script.<>.stream.out textSince the standard scripts being used delete the files final.out and stream.out at the start of execution, the program makes a copy of the current state of these files into the names listed. Thus, when recovering in the midst of a script, the files can be restored to their state when the programs were last entered.
main.LODSCORE.stream.dat text main.LODSCORE.recfile.dat textSince a crash can occur in the middle of an iteration in LODSCORE and the output of the previous call to outf() would then be lost, these files are created at the start of the loop in main() so as to preserve the old output (which hasn't yet been appended to final.out and stream.out).
When the checkpoint cannot be recovered accurately, the program checks to see whether the backup exists. Depending upon its presence (but not upon its integrity), one of the two following files is displayed:
recoveryFoundText textA backup has been found.
recoveryNotFoundText textNo backup was found.
In either case, the user is advised of the circumstance, of a possible cause for it, and of what corrective action might be taken to repair the situation as best as possible.
This section applies if you use script-level checkpointing, and wish to modify the scripts in the region surrounding the calls to ILINK, LODSCORE, MLINK, or LINKMAP, or wish to affect operations done to the files final.out and stream.out. We presume that the user is using shell scripts made with auxiliary program lcp that comes with LINKAGE. It would be impossible to make a script-level checkpointing scheme that could handle arbitrary scripts. We also assume that the user puts output in final.out and stream.out, using the default options of lcp.
The "standard" scripts for which we support script-level checkpointing affect final.out (and stream.out) on each run as follows for each ILINK run (and similarly for LODSCORE, MLINK, and LINKMAP):
lsp [...]
if [ $? = '0' -o $? = '1' ]
then
cat lsp.log >> final.out
cat lsp.stm >> stream.out
unknown
if [ $? = '0' ]
then
ilink
if [ $? = '0' ]
then
cat final.dat >> final.out
cat stream.dat >> stream.out
fi
fi
fi
To ensure that final.out is in the same state after our program has
finished execution as it would be after this piece of script code
has run, we have the following code toward the end of ILINK:
copyFile ( "final.out" , ScriptILINKFinalOut ) ;
appendFile ( "final.dat" , ScriptILINKFinalOut ) ;
if ( dostream )
{
copyFile ( "stream.out" , ScriptILINKStreamOut ) ;
appendFile ( "stream.dat" , ScriptILINKStreamOut ) ;
}
which simulates the operation of the script. This is necessary since,
at the stage where this code is run, the script-level checkpoint
routine assumes that the run of ILINK has completed successfully, so
that this entire invocation of ILINK will be ignored, and the next
invocation will copy final.out and stream.out from the files named by
the #define'd names above.
Hence, modifying the scripts in the light of script-level checkpointing requires for one to carefully study the operation of the main programs, the scripts and of the program ckpt. In general, it is necessary to mimic in the program that which would be done in the script, so that during recovery it will be indiscernible whether or not the script was stopped or not in the first place. However, these mime operations must be carefully placed, for if they are placed before the script-level checkpoint file is written to, then the operations would be performed one extra time, which is undesirable.
% ckpt lodscore aLodscoreScript
or
% ckpt ilink anIlinkScript itsArgument
or
% ckpt linkmap aLinkmapScript itsArgument
or
% ckpt mlink anMlinkScript itsArgument
where the first parameter to ckpt tells it what kind of script it is
going to run. The second parameter is the name of the actual script.
If there are additional parameters for the script itself, these can be
specified after the name of the script, as in the second example
(where "itsArgument" is provided). The second run would, hence, be
equivalent to running
% anIlinkScript itsArgument
but with the script-level checkpointing facility in action.
The code for ckpt is in the file ckpt.c. to make an executable version run the command:
make ckpt
Unfortunately, being able to detect premature halting of the invoked shell is dependent upon the value returned by the system() call. This may not work on all operating systems and architectures as desired, making this an unreliable way of stopping execution, should this be desired. It is recommended that, instead, the user do the following: