USER'S GUIDE TO THE TYPENEXT PROGRAM


X. Xie, J. Terwilliger, and J. Ott
New York State Psychiatric Institute,
and Columbia University, New York
Copyright (C) 1992-1993 Jurg Ott

25 December 1993

Contents

INTRODUCTION

INSTALLATION

OVERVIEW OF PROGRAM USAGE

AN EXAMPLE

  1. Create a pedigree file
  2. Create a simulation parameter file
  3. Create an analysis parameter file
  4. Run TYPENEXT
  5. Analyze the output file

REFERENCES


INTRODUCTION

TYPENEXT is an application of computer simulation which can assist an investigator in selecting which untyped individuals in a pedigree would contribute most to the expected lod score if they were to be typed (Ott et al. 1992).

WARNING: WE ASSUME YOU ARE FAMILIAR WITH THE LINKAGE PROGRAMS, ESPECIALLY THE SLINK PROGRAM PACKAGE. YOU CANNOT USE THIS PROGRAM CORRECTLY IF YOU ARE UNFAMILIAR WITH THE LINKAGE PROGRAMS. SEE LINKAGE ANALYSIS PACKAGE, USER'S GUIDE, AND THE SLINK DOCUMENTATION FOR MORE INFORMATION.

The power to detect linkage in a pedigree depends on many factors, such as pedigree structure, number of affected individuals, marker informativity and availability of family members for genotyping. The usefulness of typing further available family members depends not only on pedigree structure but also on which other individuals are already typed. When faced with a complex pedigree structure with many untyped individuals, on whom information could be collected, one might wish to know which individuals would be the most useful to type (i.e., which would have the greatest effect on the lod score). The computer program TYPENEXT was written for this purpose. TYPENEXT can determine how much additional information would be gained by typing each presently untyped individual in the pedigree.

The TYPENEXT program has been written in Prospero Pascal and is available in versions running under DOS or OS/2 from the authors at no charge. It uses the SLINK program (Weeks et al 1990), for a given pedigree, to simulate marker phenotypes for the untyped individuals, conditional on their known disease phenotypes, the known disease and marker phenotypes of the other pedigree members, and the recombination fraction(s). Let n be the number of individuals available for typing, and let i be the number of individuals to be typed simultaneously. If i equals 1, it means that one wants to type only one additional individual. Similarly, i=2 specifies that two additional individuals are to be typed. If an investigator wants to find out how much additional information can be gained by typing a range of k1 to k2 additional individuals, then the program will compute the increase in lod score expected after typing any subset of i individuals, k1óiók2. For example, assuming an investigator has a family with 17 members, six of whom are untyped, and wants to type another two or three of the six untyped individuals. With the proper interactive input to TYPENEXT, output from the program will tell how much information would be gained by typing any set of 2 and any set of 3 out of the 6 untyped individuals. The output file produced by TYPENEXT also contains information about how the data was simulated, and the lod scores in the original pedigree.

INSTALLATION

The TYPENEXT program calls the simulation program SLINK, the UNKNOWN program, and the analysis programs LSIMTN and MSIMTN, which are modified versions of LSIM and MSIM (Weeks et al 1990). We recommend that these five program files be placed in one directory, for example, in C:\TN. You can then access these programs by either making C:\TN the current directory or adding C:\TN to your path.

OVERVIEW OF PROGRAM USAGE

The TYPENEXT program requires four input files: a pedigree file with an additional code column, a parameter file for use by both SLINK and MSIMTN, a parameter file for LSIMTN, and a threshold file for MSIMTN and LSIMTN. It produces one output file, TYPENEXT.DAT. No SLINKIN.DAT file is required because TYEPNEXT will create that file. To use the program, you must do the following:

  1. Create a pedigree file (SIMPED.DAT) defining the pedigree structure, phenotypes and availability codes (see SLINK documentation for code definitions). It is convenient to prepare a pedigree input file, add the availability codes, and run the MAKEPED program to create SIMPED.DAT (see SLINK manual).

  2. Create a simulation parameter file (SIMDATA.DAT) defining the locus systems and the "true" recombination fractions, ie. the ones used to simulate (generate) the data. This file must be in standard MLINK format and may be produced by PREPLINK.

  3. Create an analysis parameter file (DATAIN.DAT) defining how the simulated data should be analyzed. This file must be in standard LINKMAP format and may also be produced by PREPLINK.

  4. Create a file called LIMIT.DAT holding lod score limits (thresholds), eg. 0.5 1 2 (on the same line or on different lines). If the file is empty, the default values 1, 2, and 3 will be used.

  5. Run TYPENEXT.

  6. Analyze the output file (TYPENEXT.DAT).

AN EXAMPLE

The example pedigree file contains one pedigree from Sherrington et al (1988) with marker locus 599Ha, the schizophrenia locus, and marker locus 153Ra, in that order, with 8 cM between 599Ha and the disease locus, and 5.7 cM between disease and locus 153Ra. For exercise purposes, we assume that among the seventeen individuals, seven are untyped at the marker loci while the remaining individuals have known marker genotypes (previously assigned by us at random).

                        [1]ÄÂÄ(2)
                            ³  a
                         23 ³
                         22 ³
 ÚÄÄÄÄÂÄÄÄÄÂÄÄÄÄÂÄÄÄÄÂÄÄÄÄÂÄÁÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄ¿
(3)  (4)  [5]  (6)  [7]  [8]  [9]  (10)ÄÂÄ[11]  [12]  [13]
 a         a              a    a     a  ³         a
      33   23   33        22   23       ³         23    23
      12   22   12        22   22       ³         22    22
                                        ³
                              ÚÄÄÄÄÄÂÄÄÄÁÄÄÂÄÄÄÄÄ¿
                             (14)  [15]  [16]  (17)
 a = affected                        a     a
                              33    22
                              22    22
  1. Create a pedigree file defining the pedigree structure, phenotypes and availability codes. Its default name for the TYPENEXT program is SIMPED.DAT, but any other name may be chosen. The file corresponding to the family shown above is given below.

       1   1   0   0   3   0   0 1  1  2 3   2 3   2 2  2
       1   2   0   0   3   0   0 2  0  0 0   2 2   0 0  0
       1   3   1   2   0   4   4 2  0  0 0   2 2   0 0  2
       1   4   1   2   0   5   5 2  0  3 3   2 1   1 2  2
       1   5   1   2   0   6   6 1  0  2 3   2 2   2 2  2
       1   6   1   2   0   7   7 2  0  3 3   2 1   1 2  2
       1   7   1   2   0   8   8 1  0  0 0   2 1   0 0  2
       1   8   1   2   0   9   9 1  0  2 2   2 2   2 2  2
       1   9   1   2   0  10  10 1  0  2 3   2 2   2 2  2
       1  10   1   2  14  12  12 2  0  0 0   2 2   0 0  2
       1  11   0   0  14   0   0 1  0  0 0   2 3   0 0  2
       1  12   1   2   0  13  13 1  0  2 3   2 2   2 2  2
       1  13   1   2   0   0   0 2  0  2 3   2 1   2 2  2
       1  14  11  10   0  15  15 2  0  3 3   2 1   2 2  2
       1  15  11  10   0  16  16 1  0  2 2   2 2   2 2  2
       1  16  11  10   0  17  17 1  0  0 0   2 2   0 0  2
       1  17  11  10   0   0   0 2  0  0 0   2 1   0 0  2
    
    

  2. Create a simulation parameter file defining the locus systems and the "true" (simulated) recombination fractions:
    3 0 0 5  << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
    0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
    1 2  3
    3 3  << ALLELE NUMBERS, NO. OF ALLELES
    0.3200  0.1600  0.5200   << GENE FREQUENCIES
    1 2  << AFFECTION, NO. OF ALLELES
    0.9915  8.5000008E-03   << GENE FREQUENCIES
    3 << NO. OF LIABILITY CLASSES
    1.0000 0.1400 0.0000
    0.0000 0.8600 0.0000
    1.0000 0.0000 0.0000 << PENETRANCES
    3 2  << ALLELE NUMBERS, NO. OF ALLELES
    0.3300  0.6700   << GENE FREQUENCIES
    0 0  << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
    0.07390 0.05390 << RECOMBINATION VALUES
    1 0.60000 0.45000 << REC VARIED, INCREMENT, FINISHING VALUE
    
    The default name for the input file to SLINK and MSIMTN is SIMDATA.DAT. It defines the locus systems and the true recombination fractions under which the simulation will be performed. Define the locus order in the file (assumed chromosomal order). In this case the disease locus is locus 2, and the order of loci is 1-2-3. The last line of this parameter file must be set as follows: The values of the first and third (last) number are irrelevant in this context; set the first number to 1 and the third number to 0.5. The middle (second) number must be larger than 0.5; set it to 0.6, for example. This file must be in standard MLINK format, as shown above, and may be produced by PREPLINK.

  3. Create an analysis parameter file (DATAIN.DAT) defining how the simulated data should be analyzed:
    3 0 0 4  << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
    0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF 1)
    2 1 3
    3 3  << ALLELE NUMBERS, NO. OF ALLELES
    0.3200  0.1600  0.5200   << GENE FREQUENCIES
    1 2  << AFFECTION, NO. OF ALLELES
    0.9915  0.0085    << GENE FREQUENCIES
    3 << NO. OF LIABILITY CLASSES
    1.0000 0.1400 0.0000
    0.0000 0.8600 0.0000
    1.0000 0.0000 0.0000 << PENETRANCES
    3 2  << ALLELE NUMBERS, NO. OF ALLELES
    0.3300  0.6700   << GENE FREQUENCIES
    0 0  << SEX DIFFERENCE, INTERFERENCE (IF 1 OR 2)
    0.50000 0.11980 << RECOMBINATION VALUES
    2 0.000  3 << LOCUS VARIED, FINISHING VALUE, NO OF EVALUATIONS
    
    The above parameter file for LSIMTN, called DATAIN.DAT by default, must be in standard LINKMAP format, and may also be produced by the PREPLINK program. All marker loci must be of the Allele Numbers type. To obtain lod scores from these runs, the trait locus must start out as the left-most locus and must be placed "off" the map on the left at recombination fraction 50%. The recombination fraction, é12, between the two markers can be calculated as é1+é2-2é1é2 assuming no interference (Ott 1991, p. 18), where é1 is the recombination fraction between marker 1 and the disease locus, é2 is the recombination fraction between the disease locus and marker 2, and é12 is the recombination fraction between markers 1 and 2.

  4. Run TYPENEXT

    Assume we have all the input files shown above. The total number, n, of untyped individuals is 7, and individual 2 is unavailable for typing. After typing 'TYPENEXT' and hitting the <ENTER> key, the program will prompt you with the following questions. The answers in square brackets are the defaults.

    	Pedigree file [SIMPED.DAT]:
    	Parameter file [DATAIN.DAT]:
    	Simparameter file [SIMDATA.DAT]:
    	TYPENEXT output file [TYPENEXT.DAT]:
    

    You can just hit <ENTER> to accept the default answer to each question, or you can type the new file names.

    	Detailed output file [y]?
    

    You get the third section (see below) of the output file only if you say [Y]es to this question.

    	Up to how many individuals do you want to type?
    
    Input the maximum number of individuals you want to type. If you enter 7, for instance, then the program will prompt you with:
    	Do you want to test all possible sets of 1 to 7 
    	additional individuals [N]?
    
    Note that the program proposes typing up to 7 individuals; in practice, because one of the 7 individuals is unavailable, only at most 6 individuals will be evaluated. The default answer to the above question is [N]o, and you can accept it by pressing the <ENTER> key. If you say [Y]es to this question, the output file will contain the information gained by typing any one additional individual, any set of 2 additional individuals, and so on. If your answer was [N]o the program will prompt you with the following question:
    	Input the minimum number of individuals you want to type:
    
    If you enter 2, for instance, the output file will contain the information gained by typing any set of 2, 3, etc. additional individuals. If you enter 4 then the output file will only contain the information gained by typing any set of 4 or 5 or 6 (or 7, if all individuals are available) additional individuals.

  5. Analyze the output file (TYPENEXT.DAT)

    By default, the output file produced by TYPENEXT is called TYPENEXT.DAT. It is divided into four sections.

    5.1 The first section, as shown below, is an output file from SLINK, which contains information defining the simulation, such as the random number seed, the number of replicates, the requested proportion of unlinked families, and the trait locus number. The random seed, an input quantity in SLINK, is generated internally in the TYPENEXT program so the user need not furnish a random seed (also, the results will be different each time a problem is run because of the changing seed!).

     The random number seed is: 25086
     The number of replications is:    10
     The requested proportion of unlinked families is:  0.000
     The trait locus is locus number:   2
        Summary Statistics about simped.dat
     Number of pedigrees       1
     Number of people         17
     Number of females         8
     Number of males           9
     There were   1 in category:  Marker Unknown; Trait original
     There were   0 in category:  Marker Available; Trait simulated
     There were  16 in category:  Marker Available; Trait original
     There were   0 in category:  Marker Unknown; Trait simulated
    LINKAGE/SLINK (V2.50) WITH  3-POINT AUTOSOMAL DATA
    --------------------------------------------------
    LINKED ORDER OF LOCI:   1  2  3
    ----------------------------------
    ----------------------------------
    TRUE THETAS FOR LINKED ORDER    0.073900   0.053900
    ----------------------------------
    ----------------------------------
    UNLINKED ORDER OF LOCI:   2  1  3
    ----------------------------------
    ----------------------------------
    TRUE THETAS FOR UNLINKED ORDER    0.500000   0.119834
    ----------------------------------
     Elapsed Time for one replicate =     1.90 seconds
    
     Elapsed Time =     0.32 min. or     0.01 hours
    ----------------------------------
    Actual proportion of unlinked families:  0.000
    
    *******  End of most recent SIMOUT.DAT *******
    

    5.2 The second section lists the positions (POS) of the disease locus as it moves across the fixed map of marker loci, and the locus order (ORDER) and recombination fractions (THETAS) in the different intervals for each disease position. The positions of the disease locus are numbered 0, 1, 2 etc.:

           POS          ORDER                   THETAS
            0          2  1  3            0.500        0.120
            1          2  1  3            0.333        0.120
            2          2  1  3            0.167        0.120
            3          2  1  3            0.000        0.120
            4          1  2  3            0.000        0.120
            5          1  2  3            0.040        0.087
            6          1  2  3            0.080        0.048
            7          1  2  3            0.120        0.000
            8          1  3  2            0.120        0.000
            9          1  3  2            0.120        0.167
           10          1  3  2            0.120        0.333
           11          1  3  2            0.120        0.500
    

    5.3 If you choose detailed output, for each set of individuals typed additionally, the third section of the output file will contain the expected lod score (average of lod scores over replicates) at each disease locus position, and the number of replicates which have their maximum lod score at that point. A lod score value of -9999 stands for negative infinity, that is, a likelihood of zero. Below, the numbers 1, 2, and so on, on the first line indicate the disease position, and the "No." after the disease position stands for the number of replicates with maximum lod score at that position. For example, the second line gives the expected lod score at each disease position when individual #3 is typed; at disease position 3, the values 0.26 and 9 indicate that there are 9 replicates which have their maximum lod score at that position and that the expected lod score is 0.26. From the information in the second part above, one knows that at disease position 3 the locus order is 2 1 3 and the recombination fractions are 0 between loci 2 (the disease locus) and 1, and 0.12 between loci 1 and 3. As another example, line 8 of the output below refers to the situation that two individuals are genotyped, #3 and #7. The following output file is edited for space:

    
     Added
     Indiv.    1  No.    2  No.    3  No.    4  No.    5  No.    6  No.
    ====================================================================
         0 | 0.50  0 | 1.12  0 | 1.60 10 | 1.60  0 | 1.55  0 | 1.50  0 |
    --------------------------------------------------------------------
         3 | 0.58  0 | 1.26  0 | 0.26  9 |-9999  0 | 1.79  0 | 1.77  0 |
    --------------------------------------------------------------------
         7 | 0.56  0 | 1.23  0 | 1.73  9 | 1.73  0 | 1.74  0 | 1.71  0 |
    --------------------------------------------------------------------
        10 | 0.53  0 | 1.16  0 | 1.64 10 | 1.64  0 | 1.59  0 | 1.54  0 |
    --------------------------------------------------------------------
        11 | 0.50  0 | 1.12  0 | 1.60 10 | 1.60  0 | 1.55  0 | 1.50  0 |
    --------------------------------------------------------------------
        16 | 0.58  0 | 1.28  0 | 1.82 10 | 1.82  0 | 1.75  0 | 1.68  0 |
    --------------------------------------------------------------------
        17 | 0.52  0 | 1.17  0 | 1.65  9 | 1.65  0 | 1.60  0 | 1.55  1 |
    --------------------------------------------------------------------
     3   7 | 0.64  0 | 1.37  0 | 0.39  8 |-9999  0 | 1.97  0 | 1.99  0 |
    --------------------------------------------------------------------
     3  10 | 0.61  0 | 1.30  0 | 0.30  9 |-9999  0 | 1.83  0 | 1.82  0 |
    --------------------------------------------------------------------
     7  10 | 0.59  0 | 1.27  0 | 1.77  9 | 1.77  0 | 1.78  0 | 1.76  0 |
    --------------------------------------------------------------------
     3  11 | 0.58  0 | 1.26  0 | 0.26  9 |-9999  0 | 1.79  0 | 1.77  0 |
    --------------------------------------------------------------------
     7  11 | 0.56  0 | 1.23  0 | 1.73  9 | 1.73  0 | 1.74  0 | 1.71  0 |
    --------------------------------------------------------------------
    

    5.4 The fourth section, shown below, is also edited for space and presents the crucial results of the simulations. The first three lines give two different pieces of information about the original pedigree (without additional marker typing), one being the maximum lod score and the disease position at which it occurred, and the other being the lod score at the true (simulated) recombination fractions as specified in the SIMDATA.DAT file. The table below indicates the information gained by typing additional individuals. Three statistical criteria are used to measure linkage information:

    For the original pedigree, with no additional typings, we have

          Maximum Lod Score         =     1.6072 at disease position 3
          Z(specified theta) =      1.5129
    

    Information gained from typing additional individuals:

              Expected max. lod score   Exp.lod at és      Maximum of EZ
             -------------------------  -------------  ----------------------
     Added         Change        SE of         Change        Change
     Indiv.   EZm  in EZm  diff   EZm     EZ   in EZ  maxEZ in maxEZ  SE  POS
    
         3 | 1.87 | 0.26 |      | 0.01 | 1.78 | 0.26 | 1.79 | 0.18 | 0.05 | 5
        16 | 1.81 | 0.21 | 0.05 | 0.04 | 1.69 | 0.18 | 1.81 | 0.21 | 0.04 | 3
         7 | 1.80 | 0.19 | 0.01 | 0.01 | 1.72 | 0.21 | 1.74 | 0.13 | 0.03 | 5
        17 | 1.66 | 0.05 | 0.14 | 0.08 | 1.55 | 0.04 | 1.65 | 0.04 | 0.08 | 3
        10 | 1.64 | 0.03 | 0.01 | 0.00 | 1.55 | 0.04 | 1.64 | 0.03 | 0.00 | 3
        11 | 1.60 | 0.00 | 0.03 | 0.00 | 1.51 |-0.00 | 1.60 | 0.00 | 0.00 | 3
     3  16 | 2.08 | 0.47 |      | 0.04 | 1.96 | 0.45 | 1.99 | 0.38 | 0.05 | 5
     3   7 | 2.07 | 0.46 | 0.00 | 0.02 | 1.99 | 0.48 | 1.99 | 0.38 | 0.02 | 6
     7  16 | 2.01 | 0.40 | 0.06 | 0.04 | 1.90 | 0.39 | 1.94 | 0.34 | 0.08 | 3
     3  17 | 1.92 | 0.31 | 0.09 | 0.07 | 1.82 | 0.31 | 1.84 | 0.23 | 0.07 | 5
     3  10 | 1.91 | 0.30 | 0.00 | 0.01 | 1.82 | 0.31 | 1.83 | 0.22 | 0.05 | 5
     3  11 | 1.87 | 0.26 | 0.03 | 0.01 | 1.78 | 0.26 | 1.79 | 0.18 | 0.05 | 5
    16  17 | 1.86 | 0.25 | 0.00 | 0.08 | 1.74 | 0.22 | 1.86 | 0.25 | 0.09 | 3
     7  17 | 1.85 | 0.24 | 0.01 | 0.08 | 1.77 | 0.25 | 1.79 | 0.18 | 0.09 | 5
    10  16 | 1.85 | 0.24 | 0.00 | 0.04 | 1.74 | 0.22 | 1.85 | 0.24 | 0.04 | 3
     7  10 | 1.84 | 0.23 | 0.01 | 0.01 | 1.76 | 0.25 | 1.78 | 0.17 | 0.03 | 5
    11  16 | 1.82 | 0.21 | 0.01 | 0.04 | 1.71 | 0.19 | 1.82 | 0.21 | 0.04 | 3
     7  11 | 1.80 | 0.19 | 0.02 | 0.01 | 1.72 | 0.21 | 1.74 | 0.13 | 0.03 | 5
    10  17 | 1.71 | 0.10 | 0.09 | 0.07 | 1.61 | 0.10 | 1.70 | 0.09 | 0.08 | 3
    11  17 | 1.69 | 0.09 | 0.01 | 0.07 | 1.59 | 0.08 | 1.68 | 0.08 | 0.08 | 3
    10  11 | 1.64 | 0.03 | 0.05 | 0.00 | 1.55 | 0.04 | 1.64 | 0.03 | 0.00 | 3
            SE = standard error = standard deviation/û(no. of replicates)
    
    In the output, lines are combined into groups depending on the number of individuals typed additionally. Within each group, lines are ordered by decreasing value of the expected maximum lod score, EZm.

    The column labelled "Change in EZm" gives the drop in EZm relative to the "Maximum Lod Score" (given on the second output line) of the original data. The column labelled "diff" provides the difference in EZm to the preceding line. The standard error, SE, measures the variability of the expected maximum lod score due to random error in the simulation.

    For the expected lod score (at the true és), the column labelled "Change in EZ" provides the difference to the lod score in the original data at the true és (given on output line 3 below).

    By construction, EZ ó maxEZ ó EZm. From the output above we know that the maximum lod score of the original pedigree is 1.6072 with locus order 2 1 3 and corresponding recombination fractions of 0 and 0.12. Typing individual #3, EZm increases by 0.2664 to 1.8736, EZ by 0.2688 to 1.7816, and maxEZ increases by 0.1884 to 1.7956. Similarly, typing individuals #3 and #16 increases EZm by 0.4736, EZ by 0.4546, and maxEZ by 0.3868.

    It is interesting to look at the difference between the most and second most useful sets of individuals. If this difference is small, then either choice may be equally useful. For example, the difference between adding individuals #3 and #16, and individuals #3 and #7 is only 0.0041. So, there is little difference between them. Based on this information, an investigator can choose the optimal set of individuals to type. For instance, one may want to type additional two individuals. The output shown above tells us that individuals #3 and #16 form the most informative set. But, unfortunately, individual #16 lives far away and individual #7 is right there. So one may choose individuals #3 and #7 instead of #3 and #16 since the difference in gain of EZm is only 0.0041.

REFERENCES

Ott J (1989), Computer simulation methods in human linkage analysis. Proc Natl Acad Sci USA 86:4175-4178

Ott J (1991), Analysis of Human Genetic Linkage. Johns Hopkins University Press, Baltimore

Weeks DE, Ott J, Lathrop GM (1990), SLINK: a general simulation program for linkage analysis. Am J Hum Genet 47:A204 (Supplement)

Ott J, Terwilliger JD, Xie X (1992), Determining the informativeness of untyped individuals in a pedigree analysis. Am J Hum Genet 51 (suppl), A197 (abstr)


converted to html by wentian li , august 8, 1996