linkage user's guide (version 5.2)


2.6 Description of Loci (DATAFILE)

Descriptions of loci and other information are contained in DATAFILE. The information in this file is divided into four parts
  1. general information on loci and locus order;
  2. description of loci;
  3. information on recombination;
  4. program-specific information.

In explaining the structure of DATAFILE we will use two concepts of locus order. The first is the input order, or the order in which the phenotypes corresponding to the loci appear in PEDFILE (see section 2.7 ). The second is chromosome order, or the physical order assumed for the loci. The input order is fixed once PEDFILE is created, but the chromosome order can be changed to test various hypotheses.

Various parameters such as recombination rates, gene frequencies, penetrances, etc., are specified in the DATAFILE. These refer to the initial values of these parameters. The analysis programs can modify some of these values for specific purposes, e.g. maximum likelihood estimation. This feature is explained in Chapter 3 .

The DATAFILE can be prepared with the program PREPLINK .

Example

Before we attempt to explain the format of various parts of the DATAFILE, it is useful to consider a complete file as an example. The following is the DATAFILE for three sex-linked loci, one of which is Duchenne muscular dystrophy; creatine kinase measurements are available for heterozygote testing in women:

3 0 1 5           << no loci, risk locus, sexlinked (if 1), program code
3 0.001  0.001 0  << mut locus, mut mal, mut fem, hap freq (if 1)

1 3 2             << order of loci

2 2               <<< binary factors, # alleles
5.00000E-01  5.00000E-01   << gene freqs
2                 << number of binary factors
1 0
0 1               << allelic codes

2 2               <<< binary factors, # alleles
5.00000E-01   5.00000E-01   << gene freqs
2                 << number of binary factors
1 0
0 1               << allelic codes

0 2               <<< quan, # alleles
9.99800E-01  2.00000E-04   << gene freqs
1                 << number of traits
1.57000E+00  2.10000E+00  2.10000E+00  << genotype means
5.90000E-02       << variance
2.90000E+00       << multiplier for variance in heterozygotes
0 0               << sex difference (if 1) and interference (if 1)
0.1  0.1          << recombination values
1  0.5  0.5
The last line contains information for the MLINK program; this is indicated by the program code 5 on the first line. Other parameters are specified as indicated in the comments following certain lines (indicated by << ). Comments are allowed on some lines for easy interpretation of the file.

Loci and Locus Order

The first two lines of DATAFILE contain information on a variety of parameters, including the number of loci (nlocus), a risk locus (risklocus), sex-linked or autosomal data (sexlink), a mutation locus (mutsys) and mutation rates (mutmale and mutfem ), linkage disequilibrium (disequil ), and a program code (nprogram). The first two lines are followed by a third line giving the chromosome order for the loci. The format is:
     nlocus    risklocus sexlink   nprogram
     mutsys    mutmale   mutfem    disequil
     (chromosome order)
Mutsys and the chromosome order of the loci must begin on new lines; comments can follow at the end of each line. Nprogram is not used by the LINKAGE programs, but is required for interfacing with the shell program LCP. It is used to describe the program for which the file is constructed. LCP can use files constructed for one program as input for a different program. Therefore the datafile is not changed for different programs when using LCP.

Valid values for the variables are:

nlocus         =  1 to maxlocus (as specified by a constant in the
                    programs)

risklocus      =  0 if risk is not to be calculated
               =  disease locus number (input order) if risk is to
                    be calculated                                              
     

sexlink        =  0 for autosomal data
               =  1 for sex-linked data

nprogram       =  1 CILINK
                  2 CMAP
                  3 ILINK
                  4 LINKMAP
                  5 MLINK
                  6 LODSCORE
                  7 CLODSCORE

mutsys         =  0 if mutation rates are zero
               =  mutation locus number (input order) for non-zero
                    mutation rates

mutmale        =  male mutation rate

mutfem         =  female mutation rate

disequil       =  0 if loci are assumed to be in linkage equilibrium
               =  1 if loci are in linkage disequilibrium

When loci are in linkage equilibrium, allele frequencies must be given under each locus description; otherwise, haplotype frequencies are provided. When risk is calculated, a disease allele is provided in the locus description for the "risklocus." As an example, consider the analysis of 3 autosomal loci in the chromosome order 1 3 2. The first three lines of the DATAFILE could be:
3 0 0 3   << no loci, risk locus, sexlinked (if 1), program code
3 0.1 0.1 0 << mut locus, mut mal, mut fem, haplotype freq (if 1)
1 3 2       << order of loci
The data are autosomal with mutation at the third locus.

Description of Loci

The loci are described in the order in which they appear in the PEDFILE (see section 2.7 ). Assuming linkage equilibrium, the gene frequencies are specified as part of the locus description (linkage disequilibrium will be documented in a later version). The descriptions differ according to the type of locus. A numeric code distinguishes each of the types:
     0  =      Quantitative variable
     1  =      Affection status
     2  =      Binary factors
     3  =      Numbered alleles
The format for each locus type, assuming linkage equilibrium, is as follows:

Numbered alleles

The locus description consists of two lines. The first gives the code for numbered alleles and the total number of alleles. The second gives the gene frequencies. For example:
     3 2       << numbered alleles code, total number of alleles
     0.5  0.5  << gene frequencies
specifies two alleles with equal gene frequencies.

Binary factors

The first two lines are similar to those in the previous example. After this the number of factors is specified on a separate line, followed by one line for each allele specification. As an example, consider the case of a recessive trait:

     2 2                 << binary factor code, number of alleles
     0.999  0.001        << gene frequencies
     2                   << number of factors
     1 1
     0 1                 << alleles

Affection status

The number of liability classes replaces the number of factors, and penetrances are given for each genotype in each class:
     1 2                 << affection status code, number of alleles
     0.999  0.001        << gene frequencies
     1                   << number of liability classes
     0.0  1.0  1.0       << penetrances
describes a fully penetrant, dominant disease locus. The genotypes are in the order 11, 12, 22 where 1 is the first allele and 2 is the second allele specified in the gene frequency list. For three alleles, the genotype order is 11, 12, 13, 22, 23, 33. The same pattern is followed for more alleles. To describe a similar locus, but with reduced penetrance and two liability classes, use the following:
     1 2                 << affection status code, number of alleles
     0.999  0.001        << gene frequencies
     2                   << number of liability classes
     0.0  0.5  0.5
     0.0  0.9  0.9       << penetrances
With sex-linked data, male penetrances must also be defined for each allele. The following describes a sex-linked disease with 50% penetrance in males:
     1 2                 << affection status code, number of alleles
     0.999  0.001        << gene frequencies
     1                   << number of liability classes
     0.0  0.0  1.0
     0.0  0.5            << female followed by male penetrances

Quantitative trait

Quantitative traits are described by a first line containing the quantitative code (0) and the number of alleles, and a second line with gene frequencies, as in the previous examples. These are followed by lines indicating the number of quantitative variables, genotypic means for each variable, a variance-covariance matrix, and a constant that gives the ratio of variance-covariance in heterozygotes to homozygotes.

For a single quantitative variable, the format is:

     0  2                << quantitative variable code, number of alleles
     0.999  0.001        << gene frequencies
     1                   << number of quantitative variables
     10.0  12.0  14.0    << genotypic means
     1.5                 << variance
     1.0                 << multiplier for heterozygote variance
The genotypes are 1/1, 1/2 and 2/2, respectively, where allele 1 has the frequency 0.999. For two quantitative variables, the description is:
     0  2                << quantitative variable code, number of alleles
     0.999  0.001        << gene frequencies
     2                   << number of liability classes
     10.0   12.0   14.0
    -10.0    0.0   10.0  << genotypic means
    1.5  10.0  100.0     << variance-covariance
    1.0                  << multiplier for heterozyg. variance-covariance
Only the upper triangle of the variance-covariance matrix is given; the order is V11, V12, V13 ... V22, V23 ... etc. Here, the variance of the first variable is 1.5, the covariance is 10.0, and the variance of the second variable is 100.0. When describing the "risk locus," the disease allele (risk allele) must be designated at the end of the locus description. For example:
     1  2                << affection status code, number of alleles
     0.999  0.001        << gene frequencies
     1                   << number of liability classes
     0.0  1.0  1.0       << penetrances
     2                   << risk allele

Recombination Information

In addition to recombination rates, sex-differences and interference must be specified in this section. Sex-difference options are indicated by an integer variable that takes the following values:
               
     0    =    no sex-difference
     1    =    constant sex-difference (the ratio of female/male
                 genetic distance is the same in all intervals)
     2    =    variable sex-difference (the female/male distance
                 ratio can be different in each interval)
The interference option can take the following values:
     0    =    no interference
     1    =    interference without a mapping function
     2    =    user-specified mapping function
Interference (i.e. options 1 or 2) is allowed only in some analysis programs with three loci. The programs, as distributed, contain Kosambi interference as the user-specified mapping function.

First, consider a case without interference. When the sex-difference is "0," one recombination rate is given for each of the nlocus-1 segments (see the complete example above). If the sex-difference option is "1," the male recombination rates are given on one line, and the female/male genetic distance is specified on the next line, e.g.:

     1  0                << sex difference, interference
     0.1  0.2  0.1       << male recombination
     2.0                 << female/male ratio of genetic distance
When the sex-difference option is "2", the male recombination rates are followed on the next line by female recombination rates:
     2  0                << sex difference, interference
     0.1  0.2  0.1       << male recombination
     0.2  0.1  0.2       << female recombination
Interference can be specified for three loci. With the interference option 1, three recombination rates are given. These are the recombination rates between adjacent loci in the two segments and the recombination rate between the flanking loci. An example is:
     1  1                << sex difference, interference
     0.1  0.1  0.18      << male recombination
     2.0                 << female/male ratio of genetic distance
With the interference option 2, only the rates between the adjacent loci are provided:
     1  2                << sex difference, interference
     0.1  0.1            << male recombination
     2.0                 << female/male ratio of genetic distance

Program-specific information

The program-specific information consists of a series of lines at the end of the DATAFILE describing which parameters should be varied iteratively by the analysis programs. The format for each program is described in Chapter 3.


previous: 2.5 quantitative variables
next: 2.7 pedigree information
up: 2. structure of input data