In explaining the structure of DATAFILE we will use two concepts of locus order. The first is the input order, or the order in which the phenotypes corresponding to the loci appear in PEDFILE (see section 2.7 ). The second is chromosome order, or the physical order assumed for the loci. The input order is fixed once PEDFILE is created, but the chromosome order can be changed to test various hypotheses.
Various parameters such as recombination rates, gene frequencies, penetrances, etc., are specified in the DATAFILE. These refer to the initial values of these parameters. The analysis programs can modify some of these values for specific purposes, e.g. maximum likelihood estimation. This feature is explained in Chapter 3 .
The DATAFILE can be prepared with the program PREPLINK .
Before we attempt to explain the format of various parts of the DATAFILE, it is useful to consider a complete file as an example. The following is the DATAFILE for three sex-linked loci, one of which is Duchenne muscular dystrophy; creatine kinase measurements are available for heterozygote testing in women:
3 0 1 5 << no loci, risk locus, sexlinked (if 1), program code 3 0.001 0.001 0 << mut locus, mut mal, mut fem, hap freq (if 1) 1 3 2 << order of loci 2 2 <<< binary factors, # alleles 5.00000E-01 5.00000E-01 << gene freqs 2 << number of binary factors 1 0 0 1 << allelic codes 2 2 <<< binary factors, # alleles 5.00000E-01 5.00000E-01 << gene freqs 2 << number of binary factors 1 0 0 1 << allelic codes 0 2 <<< quan, # alleles 9.99800E-01 2.00000E-04 << gene freqs 1 << number of traits 1.57000E+00 2.10000E+00 2.10000E+00 << genotype means 5.90000E-02 << variance 2.90000E+00 << multiplier for variance in heterozygotes 0 0 << sex difference (if 1) and interference (if 1) 0.1 0.1 << recombination values 1 0.5 0.5The last line contains information for the MLINK program; this is indicated by the program code 5 on the first line. Other parameters are specified as indicated in the comments following certain lines (indicated by << ). Comments are allowed on some lines for easy interpretation of the file.
nlocus risklocus sexlink nprogram
mutsys mutmale mutfem disequil
(chromosome order)
Mutsys and the chromosome order of the loci must begin on new
lines; comments can follow at the end of each line. Nprogram is
not used by the LINKAGE programs, but is required for interfacing
with the shell program LCP. It is used to describe the program
for which the file is constructed. LCP can use files constructed
for one program as input for a different program. Therefore the
datafile is not changed for different programs when using LCP.
Valid values for the variables are:
nlocus = 1 to maxlocus (as specified by a constant in the
programs)
risklocus = 0 if risk is not to be calculated
= disease locus number (input order) if risk is to
be calculated
sexlink = 0 for autosomal data
= 1 for sex-linked data
nprogram = 1 CILINK
2 CMAP
3 ILINK
4 LINKMAP
5 MLINK
6 LODSCORE
7 CLODSCORE
mutsys = 0 if mutation rates are zero
= mutation locus number (input order) for non-zero
mutation rates
mutmale = male mutation rate
mutfem = female mutation rate
disequil = 0 if loci are assumed to be in linkage equilibrium
= 1 if loci are in linkage disequilibrium
When loci are in linkage equilibrium, allele frequencies must be
given under each locus description; otherwise, haplotype frequencies
are provided. When risk is calculated, a disease allele is
provided in the locus description for the "risklocus." As an
example, consider the analysis of 3 autosomal loci in the chromosome
order 1 3 2. The first three lines of the DATAFILE could be:
3 0 0 3 << no loci, risk locus, sexlinked (if 1), program code 3 0.1 0.1 0 << mut locus, mut mal, mut fem, haplotype freq (if 1) 1 3 2 << order of lociThe data are autosomal with mutation at the third locus.
0 = Quantitative variable
1 = Affection status
2 = Binary factors
3 = Numbered alleles
The format for each locus type, assuming linkage equilibrium, is
as follows:
3 2 << numbered alleles code, total number of alleles
0.5 0.5 << gene frequencies
specifies two alleles with equal gene frequencies.
2 2 << binary factor code, number of alleles
0.999 0.001 << gene frequencies
2 << number of factors
1 1
0 1 << alleles
1 2 << affection status code, number of alleles
0.999 0.001 << gene frequencies
1 << number of liability classes
0.0 1.0 1.0 << penetrances
describes a fully penetrant, dominant disease locus. The genotypes
are in the order 11, 12, 22 where 1 is the first allele and
2 is the second allele specified in the gene frequency list. For
three alleles, the genotype order is 11, 12, 13, 22, 23, 33. The
same pattern is followed for more alleles. To describe a similar
locus, but with reduced penetrance and two liability classes, use
the following:
1 2 << affection status code, number of alleles
0.999 0.001 << gene frequencies
2 << number of liability classes
0.0 0.5 0.5
0.0 0.9 0.9 << penetrances
With sex-linked data, male penetrances must also be defined for
each allele. The following describes a sex-linked disease with
50% penetrance in males:
1 2 << affection status code, number of alleles
0.999 0.001 << gene frequencies
1 << number of liability classes
0.0 0.0 1.0
0.0 0.5 << female followed by male penetrances
For a single quantitative variable, the format is:
0 2 << quantitative variable code, number of alleles
0.999 0.001 << gene frequencies
1 << number of quantitative variables
10.0 12.0 14.0 << genotypic means
1.5 << variance
1.0 << multiplier for heterozygote variance
The genotypes are 1/1, 1/2 and 2/2, respectively, where allele 1
has the frequency 0.999. For two quantitative variables, the
description is:
0 2 << quantitative variable code, number of alleles
0.999 0.001 << gene frequencies
2 << number of liability classes
10.0 12.0 14.0
-10.0 0.0 10.0 << genotypic means
1.5 10.0 100.0 << variance-covariance
1.0 << multiplier for heterozyg. variance-covariance
Only the upper triangle of the variance-covariance matrix is
given; the order is V11, V12, V13 ... V22, V23 ... etc. Here, the
variance of the first variable is 1.5, the covariance is 10.0,
and the variance of the second variable is 100.0. When describing
the "risk locus," the disease allele (risk allele) must be
designated at the end of the locus description. For example:
1 2 << affection status code, number of alleles
0.999 0.001 << gene frequencies
1 << number of liability classes
0.0 1.0 1.0 << penetrances
2 << risk allele
0 = no sex-difference
1 = constant sex-difference (the ratio of female/male
genetic distance is the same in all intervals)
2 = variable sex-difference (the female/male distance
ratio can be different in each interval)
The interference option can take the following values:
0 = no interference
1 = interference without a mapping function
2 = user-specified mapping function
Interference (i.e. options 1 or 2) is allowed only in some
analysis programs with three loci. The programs, as distributed,
contain Kosambi interference as the user-specified mapping
function.
First, consider a case without interference. When the sex-difference is "0," one recombination rate is given for each of the nlocus-1 segments (see the complete example above). If the sex-difference option is "1," the male recombination rates are given on one line, and the female/male genetic distance is specified on the next line, e.g.:
1 0 << sex difference, interference
0.1 0.2 0.1 << male recombination
2.0 << female/male ratio of genetic distance
When the sex-difference option is "2", the male recombination
rates are followed on the next line by female recombination
rates:
2 0 << sex difference, interference
0.1 0.2 0.1 << male recombination
0.2 0.1 0.2 << female recombination
Interference can be specified for three loci. With the interference
option 1, three recombination rates are given. These are
the recombination rates between adjacent loci in the two segments
and the recombination rate between the flanking loci. An example is:
1 1 << sex difference, interference
0.1 0.1 0.18 << male recombination
2.0 << female/male ratio of genetic distance
With the interference option 2, only the rates between the
adjacent loci are provided:
1 2 << sex difference, interference
0.1 0.1 << male recombination
2.0 << female/male ratio of genetic distance