From: softlib.cs.rice.edu
Last mod: June 8, 1995

Checkpointing in FASTLINK

by K. Shriram and A. A. Schaffer

Rice University


This README file is meant to accompany version 2.3P and beyond of FASTLINK. See the top-level README file for a roadmap to all the documentation.

This file describes in detail the checkpointing scheme that was implemented by K. Shriram and A. A. Schaffer. Checkpointing means periodically saving the state of a computation. The purpose of checkpointing is to be able to recover from a crash of the underlying computer that causes one of the FASTLINK programs to stop for a reason that has nothing to do with its computation. Two common causes for such crashes are power failures and lightning hits. Right now checkpointing works only for the sequential versions of FASTLINK on UNIX, and for MLINK and LINKMAP on VMS. A. A. Schaffer, S. K. Gupta, K. Shriram, and R. W. Cottingham Jr., Avoiding Recomputation in Genetic Linkage Analysis, Human Heredity 44(1994), pp. 225-237. This paper can be found in paper2.ps that comes with the FASTLINK distribution. At the time the paper was written, the checkpointing scheme had been implemented only in LODSCORE and ILINK; these are the two difficult cases for checkpointing and the programs where it is most needed.

After seeing the checkpointing scheme in LODSCORE and ILINK for versions 2.0 and 2.1, several users who had suffered machine crashes during LINKMAP runs clamored for extending the scheme to the other two programs. As of version 2.2, all four programs have checkpointing and crash-recovery.

Through version 1.1, FASTLINK provided the same level of functionality as LINKAGE 5.1. Checkpointing adds new functionality, so we decided to write more detailed documentation about the checkpointing facility. Any questions, comments, or complaints should be directed to Alejandro Schaffer (schaffer@cs.rice.edu).

Frequent LINKAGE users almost certainly have had the computer crash during a long run, only to have to start the computation again. We have now included a "checkpointing" package in the code that occasionally saves the state of the computation, so that a crashed program can be restarted without much computation lost. The folklore wisdom seems to be that this form of augmentation to programs is the proper mechanism for recovering from crashes. This file briefly the checkpointing process and explains the files connected with our implementation.

There are standard packages that do checkpointing of programs for specific operating systems, but we wanted our code to be somewhat portable because LINKAGE is used on a variety of operating systems.

Unless otherwise specified, the descriptions that follow apply equally to all of ILINK,LODSCORE, LINKMAP, and MLINK. In particular, to distinguish the two, we use the names of the programs in the filenames. We shall annotate this by the string "<>", which should be replaced by the program name in question. Thus, for instance, the filename `checkpoint<>.bak' would denote `checkpointILINK.bak' or `checkpointLODSCORE.bak', depending upon context.

Before getting into details, there are three VERY IMPORTANT cautions in using the FASTLINK crash-recovery scheme.

  1. After a crash occurs, if you run the program in the same directory where it was running before, the program will assume that you want to restart the crashed run. The only way to have the program start a different run is to delete all the files created by the checkpointing scheme. The files created during checkpointing will have names with one of the following prefixes: checkpoint, script, outf, main. To remove these files, you can use the command:
       rm checkpoint* script* outf* main*
    
    Note: extreme care should be taken when removing these files that you don't have other meaningful files in the same directory with any of these prefixes. If "rm" doesn't normally prompt you for each file before removing it, it is probably wiser to delete these files by hand.

  2. The time to save the state to a file is not zero. Therefore, if a crash occurs while the state is being saved, the program may be a little confused on restart. In particular, it may unnecessarily redo one or two likelihood function evaluations. When this happens with LINKMAP or MLINK, it means that duplicate data will show up in the output file because they write out their output after each likelihood function evaluation.

  3. The checkpointing scheme has been extensively tested with simulated crashes, but we do not induce a crash of the whole system in testing. Furthermore, system-wide crashes can have bizarre and unimaginable side-effects. Therefore, user feedback based on what happened during real crashes and real runs will be invaluable in making the checkpointing system more robust.

    The Process

    Most of the discussion below focuses on the programs ILINK and LODSCORE. At the end we explain the much simpler method of checkpointing used in LINKMAP and MLINK.

    The programs ILINK and LODSCORE perform checkpointing at two distinct types of locations. A checkpoint is created at the start of each iteration (in the function iterate()); it is also made at the beginning of the functions initialize(), outf(), firststep(), decreaset() and increaset(), and at the beginning of the loops in gforward() and gcentral(). We distinguish between these two types by the terms "iteration-" and "function-checkpoint", respectively; the latter term is used since the program proceeds to make one or more calls to the routine fun() shortly after the location of checkpointing.

    In the case of LINKMAP a simple checkpoint is taken after each likelihood function evaluation. MLINK is the same except we do not checkpoint on the first function evaluation where the moving marker is unlinked to the others.

    The files final.dat and stream.dat (if requested) primarily contain the output, so a checkpointing mechanism must take care to ensure the contents of these files are not altered in any way by the process. In ILINK and LODSCORE All output to these files takes place in the routine outf() (and from the routines it calls); hence, these files are checkpointed before entry into outf(). More details on this follow under the discussion of the actual files created. In MLINK and LINKMAP these files are updates after each function evaluation, so they have to checkpointed as well.


    The Files

    The following is a list of the files created for the purposes of checkpointing. All of these files are placed in the working directory of the current run of the program.

    It is important to ensure that none of these files are present at the start of a fresh run; however, do not delete any of these after a run has begun, and especially when trying to recover from a previous run.

    NOTE: Please note that the file protections set by the program may not be what you desire. These can be changed by altering the value of the variable CopyAppendPerms in the file checkpointdefs.h, where the value specified should be as given to the chmod(1) command. (The additional leading `0' is essential; it makes the value that follows to be treated as a constant in octal, as required by chmod(1).)

    checkpoint.<>                                             text, binary
    

    For ILINK and LODSCORE: This file is written at two types of places, namely an iteration- and a function-checkpoint. Only three parts of this file are in text mode; they are:

    Following these are the bytes that constitute the actual values being stored; these are in an architecture-dependent binary format.

    Finally, the end-marker provides us with a means of partially checking for the integrity of the data written in the checkpoint. For LINKMAP and MLINK only some counters indicating how many function evaluations are complete need to be stored in this file.

    checkpoint.<>.bak                                         text, binary
    

    When a checkpoint is to be written and a checkpoint file is already found, the existing file is moved to this backup name and the new one is written in its place. The main purpose of doing this is to increase security against crashes: should the crash have damaged the checkpoint file but have left the backup untarnished, the backup may be copied into the checkpoint and computation can be resumed, even if from a slightly earlier stage in the run. The format of this file is the same as that of the checkpoint file, which is copied into the backup without modification.

    outf.LODSCORE.stream.dat                                          text
    outf.ILINK.stream.dat                                             text
    main.LINKMAP.stream.dat                                           text
    main.MLINK.stream.dat                                             text
    outf.ILINK.final.dat                                              text
    main.LINKMAP.final.dat                                            text
    main.MLINK.final.dat                                              text
    outf.LODSCORE.recfile.dat                                         text
    
    These files are created by the subroutine outf() or main(). Their purpose is to maintain copies of the files stream.dat and final.dat (for ILINK, LINKMAP, or MLINK) or recfile.dat (for LODSCORE), respectively, so that if recovery needs to take place after these files have been written to, the two files can be restored to the state they had.
    script.<>.final.out                                               text
    script.<>.stream.out                                              text
    
    Since the standard scripts being used delete the files final.out and stream.out at the start of execution, the program makes a copy of the current state of these files into the names listed. Thus, when recovering in the midst of a script, the files can be restored to their state when the programs were last entered.
    main.LODSCORE.stream.dat                                          text
    main.LODSCORE.recfile.dat                                         text
    
    Since a crash can occur in the middle of an iteration in LODSCORE and the output of the previous call to outf() would then be lost, these files are created at the start of the loop in main() so as to preserve the old output (which hasn't yet been appended to final.out and stream.out).

    When the checkpoint cannot be recovered accurately, the program checks to see whether the backup exists. Depending upon its presence (but not upon its integrity), one of the two following files is displayed:

    recoveryFoundText                                                  text
    
    A backup has been found.
    recoveryNotFoundText                                               text
    
    No backup was found.

    In either case, the user is advised of the circumstance, of a possible cause for it, and of what corrective action might be taken to repair the situation as best as possible.


    Modifying Scripts and Checkpointing

    Our experience shows that some users request multiple runs of a FASTLINK program with one shell script. As a consequence a crash may occur after some (but not all) of the requested runs are complete. When this happens, it would be nice not to lose the results of the completed runs. A user who restarts the crashed script would not like the runs that were completed previously to be redone. We have made a primitive facility to do this type of checkpointing, which we call "script-level checkpointing". However, for users who want to be safe we recommend doing only one run per shell script.

    This section applies if you use script-level checkpointing, and wish to modify the scripts in the region surrounding the calls to ILINK, LODSCORE, MLINK, or LINKMAP, or wish to affect operations done to the files final.out and stream.out. We presume that the user is using shell scripts made with auxiliary program lcp that comes with LINKAGE. It would be impossible to make a script-level checkpointing scheme that could handle arbitrary scripts. We also assume that the user puts output in final.out and stream.out, using the default options of lcp.

    The "standard" scripts for which we support script-level checkpointing affect final.out (and stream.out) on each run as follows for each ILINK run (and similarly for LODSCORE, MLINK, and LINKMAP):

        lsp [...]
        if [ $? = '0' -o $? = '1' ]
        then
          cat lsp.log >> final.out
          cat lsp.stm >> stream.out
          unknown
          if [ $? = '0' ]
          then
            ilink
            if [ $? = '0' ]
            then
              cat final.dat  >> final.out
              cat stream.dat >> stream.out
            fi
          fi
        fi
    
    To ensure that final.out is in the same state after our program has finished execution as it would be after this piece of script code has run, we have the following code toward the end of ILINK:
        copyFile ( "final.out" , ScriptILINKFinalOut ) ;
        appendFile ( "final.dat" , ScriptILINKFinalOut ) ;
        
        if ( dostream )
        {
          copyFile ( "stream.out" , ScriptILINKStreamOut ) ;
          appendFile ( "stream.dat" , ScriptILINKStreamOut ) ;
        }
    
    which simulates the operation of the script. This is necessary since, at the stage where this code is run, the script-level checkpoint routine assumes that the run of ILINK has completed successfully, so that this entire invocation of ILINK will be ignored, and the next invocation will copy final.out and stream.out from the files named by the #define'd names above.

    Hence, modifying the scripts in the light of script-level checkpointing requires for one to carefully study the operation of the main programs, the scripts and of the program ckpt. In general, it is necessary to mimic in the program that which would be done in the script, so that during recovery it will be indiscernible whether or not the script was stopped or not in the first place. However, these mime operations must be carefully placed, for if they are placed before the script-level checkpoint file is written to, then the operations would be performed one extra time, which is undesirable.


    Using the Script-Level Checkpointing Facility

    The program ckpt implements the script-level checkpointing facility (with cooperation from ilink and lodscore, as appropriate). It's primary task is to accept the name of a script to be run, and a specification of whether the script is for ILINK or for LODSCORE. A typical invocation might look like this (we use `%' to denote the user's prompt):
        % ckpt lodscore aLodscoreScript
    
    or
        % ckpt ilink anIlinkScript itsArgument
    
    or
        % ckpt linkmap aLinkmapScript itsArgument
    
    or
        % ckpt mlink  anMlinkScript itsArgument
    
    where the first parameter to ckpt tells it what kind of script it is going to run. The second parameter is the name of the actual script. If there are additional parameters for the script itself, these can be specified after the name of the script, as in the second example (where "itsArgument" is provided). The second run would, hence, be equivalent to running
        % anIlinkScript itsArgument
    
    but with the script-level checkpointing facility in action.

    The code for ckpt is in the file ckpt.c. to make an executable version run the command:

        make ckpt
    

    Important Caution on Breaking a ckpt Run

    The ckpt program executes a system(3) call to invoke a shell in which to run the named script (with it's arguments, if any). Hence, if the user decides to abort execution and breaks execution by hitting, say, Control-C (^C), this will certainly stop the invoked shell, but will not necessarily abort the calling process (ie, ckpt). This has the following deleterious effect: when control returns to ckpt, if it is indistinguishable that the invoked shell was halted prematurely, then ckpt erases its data file, so the next time it is run, it will assume that the previous run exited normally. This is clearly not the desired effect.

    Unfortunately, being able to detect premature halting of the invoked shell is dependent upon the value returned by the system() call. This may not work on all operating systems and architectures as desired, making this an unreliable way of stopping execution, should this be desired. It is recommended that, instead, the user do the following:

    1. Suspend the executing process(es), typically by hitting a key like Control-Z (^Z).
    2. Kill the suspended process, usually by typing a command such as "kill %+".
    Again, this is not guaranteed to succeed, but should work on most systems. Note, of course, that it requires the shell to support job control and also that the shell was compiled with this feature installed.
    back to fastlink