User's Guide to the TDTae program 2.0
=========================================
TABLE OF CONTENTS
1.0 OVERVIEW
1.1 Theory
2.0 RUNNING THE TDTae
PROGRAM
2.1
Input files
and usage
2.1.1 Command line options
2.2
Example
files with this distribution
2.3
Error models
3.0 INTERPRETING RESULTS
FROM TDTae OUTPUT
3.1 Example runs
3.2 A
note about robustness to population stratification
4.0 MAXIMIZATION
PROCEDURES
4.1
Maximization
method used
4.2
Potential
maximization issues
4.3
A note about
computation time
5.0 PROBLEMS? COMMENTS?
(contact information)
6.0 ACKNOWLEDGEMENTS
7.0 REFERENCES
1.0 OVERVIEW
This program is compiled to work in UNIX Solaris, LINUX, and
Windows (PC) operating systems. All commands are executed from the command line
in UNIX and LINUX or DOS prompt in Windows.
This program is designed to perform a likelihood-based
transmission disequilibrium test (TDT) on genotype data from families in which
there is at least one affected child. Unlike other programs that perform
analyses with TDT, our program will allow Mendelian inconsistencies to be
present in the data. A two-stage search procedure is implemented (see Section
4.1) to compute the maximum log-likelihoods under a null (H0) and alternative
hypothesis (H1), and the difference of these log likelihoods provides the value
of the test statistic.
A key assumption in our method is that errors that occur in the
data are random and independent. If this assumption is incorrect, or if the
genotype data are "cleaned" (Mendelian inconsistencies removed before
use of the TDTae program) results of our analysis are likely to be invalid.
1.1 Theory
As mentioned above, the TDTae method performs a likelihood ratio
test to test for linkage in the presence of association. Our motivation for
developing this method is robustness. Two issues regarding
the robustness of the original transmission disequilibrium test (TDT) developed
by Spielman et al. (Spielman
et al. 1993) are: (i)
missing parental genotype data and (ii) the presence of undetected genotype
errors. While extensions of the TDT that are robust to items (i) and (ii) have
been developed, there was no single TDT statistic that is robust to both for
general pedigrees. We developed a likelihood method (Gordon
et al. 2001; Gordon et al. 2004), the TDTae,
which is robust to these items in general pedigrees. The TDTae assumes a more
general disease model than the traditional TDT, which assumes a multiplicative
inheritance model for genotypic relative risk. Our model is based on Weinberg’s
work (Weinberg
1999).
Full details of the TDTae method may be found in our paper (Gordon
et al. 2004). The TDTae
statistic is given by the formula:
, (*)
where
is the set of
observed (and possibly inconsistent)
genotypes for all pedigrees,
and
are the genotypic relative risks for an (unobserved)
di-allelic trait locus with low-risk (wild-type) allele + and high-risk (disease) allele d,
is the population genotype frequency of the 11 genotype,
is the population genotype frequency of the 12 genotype, and E
is the vector of parameters for a given error model (see below for list of
error models and their parameters). The null hypothesis is that
The likelihood
, which is a function of the parameters
, is maximized under the alternative (single carat in
equation (*)) and under the null (double carat in equation (*), with
set equal to 1.0). Twice the log-difference is asymptotically
distributed as a central
distribution with 2
degrees of freedom. By placing constraints on the genotypic relative risks, the
degrees of freedom for the TDTae statistic may be reduced to 1 (see Section
2.1.1).
This software program
computes the maximum log-likelihoods under the alternative and null hypotheses
and thus produces maximum likelihood estimates (MLEs) of
under each
hypothesis. Discussion of the maximization procedure is described below
(section 4.1).
2.0 RUNNING THE TDTae PROGRAM
2.1 Input files and usage
With our new version of TDTae (version 2.0), users only need one
file to run the TDTae program. It is a pedigree file (described below). Another
optional file is a marker file that contains the list of markers being used in
the analysis.
(Pedigree file)
This file contains information on pedigree structure, affection
status, and genotypes. This file is the same as the “pedin.pre” or “pedin.dat”
file in the LINKAGE and FASTLINK programs (see Terwiller and Ott 1994). This
file can be in either pre-MAKEPED (default) or post-MAKEPED LINKAGE format,
delimited by either spaces or tabs. If there are any observed Mendelian
inconsistencies in the dataset, our program requires that such inconsistencies
NOT BE REMOVED from the pedigree file. We make this requirement to insure that
the results from the TDTae analyses are valid (Gordon et al. 2001).
An example of a pedigree file (in pre-MAKEPED format) is given
below.
Pedigree Ind Father
Mother Gender Aff Locus 1 Genotype Locus 2 Genotype etc
1 1
0 0 1
1 1 1 1
2
1 2
0 0 2
1 2 1 1
2
1 3
1 2 2
2 2 2 1
2
2 1
0 0 1
1 1 1 1
2
2 2
0 0 2
1 2 2 0
0
2 3
1 2 2
2 2 2 1
2
2 4
1 2 1
2 1 1 2
2
…
Note that in this example file, there are Mendelian
inconsistencies at the first marker locus for the first and second families.
(Marker file)
This file is a list of marker names corresponding to the markers
in the pedigree file. There is no header line - one just lists the names of
each marker, starting with the first locus in your list.
An example of a marker file is given below.
D8S1125
D8S565
GATA3266
SNP1
…
Usage of the program is as follows:
> tdtae [OPTIONS] <input file> <error model>
[<locus> ..]
We explain each item:
[OPTIONS]: A list of options that may be used when running TDTae
(see Section 2.1.1).
<input file>: The name of the pedigree file.
<error model>: The selected error model for use in the
analysis – The options are GLHO, DSB, SPL, or MA. More details on each of these
error models is provided below (see Section 2.3.1).
[<locus>..] (optional): The list of markers for which the
TDTae analysis will be performed. Note that this list must be a list of
positive integers corresponding to ordered markers in the pedigree file. If no
list is provided, the program will run on all markers.
It is important to note that the list of options MUST come before
the pedigree file and error model chosen on the command line.
Example:
Suppose the name of the pedigree file is “pederr.pre” and the name
of the marker file is “markers.txt”. Suppose further that pederr.pre has 5
marker loci that have been genotyped. If the name of our executable code is
“tdtae” then we can run the TDTae program by typing:
>tdtae pederr.pre MA 2 3
at the command line. Here, the phrase MA indicates what error will
be used when performing the analysis. In this case, it is the Mote-Anderson
error model (Mote and Anderson 1965) (see Section 2.3). The numbers following “MA” indicate that
only markers 2 and 3 will be analyzed (out of a possible 5 markers).
If we type
>tdtae pederr.pre MA
then all 5 loci will be analyzed.
2.1.1 Command line options
The TDTae program version 2.0 comes with a list of features that
are available at the command line. To access these features, type
> tdtae
at the command line (prompt) and hit the Enter (Return) key. You
should see the following list of command line features.
Missing arguments
Program TDTAE - Version 2.0 using NR library
Usage: tdtae [OPTIONS] <input file> <error model>
[<locus> ..]
OPTIONS:
-a Specify minimum allele percentage
(default: 10)
-g Group
alleles with low count
-s Calculate support interval
-b Set support bound (default: 2)
-n Specify number of search results to
use (default: 5)
-po Input file is in POSTMAKE format
-o Specify output file
-f Specify file containing marker
names
-t Specify file for trimming output
-x Specify maximum number of founders
(default: 9)
-v Verbose
-e <outfile>
Calculate deviations from Hardy-Weinberg Equilibrium
-d Use dominant model
-r Use recessive model
-m Use mult model
-c Specify number of cuts (default: 5)
Valid Error Models
are: DSB, GLHO, SPL, MA
If no loci are
specified all will be analyzed
We explain each of these options in the order of their appearance
above.
-a: The default setting for the minimum
minor allele frequency of any allele being tested is 10%. That means, for a SNP
with two alleles coded 1 and 2, unless either allele has a frequency of 10%,
TDTae will not analyze that marker. For multi-allelic loci, with coded alleles
1, 2, 3, etc, unless an allele i or not i (i.e., all other
alleles) each have frequency of at least 10%, TDTae will not analyze that
allele. This option, which requires a
positive integer to follow it, allows the user to change the minimum frequency.
For example, typing “-a 20” changes the minimum allele frequency requirement to
20%. Our experience with this software is that the minimum allele frequency
should be at least 10%. The maximization method does not perform well when the
minor allele frequency is very small (see also Section 4.2).
-g: This
option groups together all alleles whose number of appearances is below the
minimum count (default = 30; also see –a command above) into one allele.
-s: This option allows for calculation of
support intervals (Edwards 1992) for each of the maximum likelihood
parameters under H0 and H1. The default setting is 2; that is, the endpoints of
the 2-unit support interval (i.e., 100:1 odds) of the MLEs of each parameter
are provided. The default setting can be changed by using the “-b” option (see
directly below).
-b: This option enables the user to specify
the length of the support interval when calculating using the “-s” option
above. This option must be followed by a positive integer. For example, typing
“-b 3” produces a 3-unit support interval instead of the default 2-unit
interval.
-n: When performing maximization under H0 or
H1, a two-stage procedure is employed (see Section 4.1 below on maximization).
Once the grid search (1st stage) is finished, parameter values
corresponding to the largest n log-likelihoods are used as starting
points for the Powell maximization method (Acton
1970; Brent 1973; Jacobs 1977). This option allows the user to specify the number of largest n
log-likelihoods that will be followed up (default is 5). When using this
option, the user must specify a positive integer indicating the number of
largest log-likelihoods that will be followed up.
-po: This option instructs the program that the
format of the pedigree file is post-MAKEPED format. If this option is not used,
then the program will assume that the format is pre-MAKEPED.
-o: This option enables the user to specify
the name of the output file. It is followed by the user-specified name of the
file. If this option is not used, the results will be written to the screen
only.
-f: Invoking this option enables the user
to specify marker names that will be used when reporting results. If no such
file of marker names is provided, then the output file will label each of the
markers “Locus #1, Locus #2, etc”.
-t: With this option, the user can view
what individuals were “trimmed” by the TDTae program to decrease the
computational load. Also see the “-x”
option (next).
-x: In its present formulation, TDTae’s
computational time to produce results increases with the number of individuals
in a pedigree. This option enables the user to trim the number of founders from
the pedigree so that the size of the pedigree that is analyzed is reduced. This
option must be followed by a positive integer. The default maximum number of
founders in a pedigree is 9.
-v: This option allows the user to view
progress of the maximizations for each marker locus and allele. It also provides
an estimation of the time till completion for each TDTae analysis with a given
allele.
-e: With this option the user can test
whether recoded genotypes on founders are in Hardy-Weinberg proportions. This
option uses the method employed in the HWE program (http://linkage.rockefeller.edu/ott/linkutil.htm#HWE). This option requires that the user
specify an output file for the results.
The next three options are related. The default analysis for TDTae
involves maximization over the parameters
and
under the
alternative, with no constraints placed on the relationship of these
parameters. As such, the resulting test statistic has 2 degrees of freedom
(df). The following options allow the user to perform a 1 df test, subject to
constraints on the parameters
and
. These tests might be used if the user has some prior
knowledge of the mode of inheritance of the trait being studied and wishes to
potentially increase power by reducing the degrees of freedom. The following
constraints are invoked when using the three options:
-d:
(dominant
mode of inheritance)
-r:
(recessive
mode of inheritance)
-m:
(multiplicative
mode of inheritance)
It is interesting to note that using the “-m” option is equivalent
to performing a TDT analysis with the original TDT statistic (Weinberg 1999).
-c: This option allows the user to specify
the number of “cuts” c that are used in the first stage of the search
procedure (see Section 4.1 below). This option must be followed by a positive
integer greater than 1. The default number of cuts c used is 5.
2.2 Example files with this distribution
Example pedigree files, marker files, and output files are provided
with the distribution of this software. The pedigree files are: pedsim-err.pre
(simulated data), psor17.pre (real data from a study of psoriasis pedigrees on
chromosome 17 (Helms et al. 2003)) and sito.pre (real data from a study of
sitosterolemia pedigrees on chromosome 2 (Lee et al. 2001)). The corresponding marker files are:
markers-pedsim.txt and markers-sito.txt (there is no marker file for the
psoriasis data).
2.3 Error models
To run the TDTae program with your data, you will need (as
mentioned above) a pedigree file in either pre- or post-MAKEPED format. You
must also specify the particular error model that you will use when performing
the TDTae analyses. The choices are:
GLHO (Gordon Liu Heath Ott)
(Gordon et al. 2001)
DSB (Douglas Skol Boehnke) (Douglas et al. 2002)
SPL (Sobel Papp Lange) (Sobel et al. 2002)
MA (Mote Anderson) (Mote and Anderson 1965)
A brief description of each error model is provided here.
Notationally, we assume that all markers have two (possibly down-coded) alleles
labeled 1 and 2. The parameter list for each error model is provided below this
description. Also see our website, http://linkage.rockefeller.edu/pawe/. The GLHO model introduces errors into
alleles as opposed to genotypes. It is described by 2 parameters. The DSB model
introduces errors into genotypes, and is the only model for which it is not
possible for a homozygous 11 genotype to be incorrectly recoded as a homozygous
22 genotype, or vice versa. It is
described by 2 parameters. The SPL
model is, for di-allelic loci, described by 3 parameters. It is the most
general error model possible for di-allelic loci, under the constraint that
errors are independent of the particular allele. The MA model, which is the most general error model possible in
the sense that it can describe all other error models, is described by 6
parameters. The GLHO, SPL, and MA error
models all allow for errors in which one homozygote is incorrectly miscoded as
another homozygote.
Gordon Heath Liu Ott (GHLO) error model parameters
The parameter settings for this error model are:
E1 = Pr(1 allele incorrectly coded as 2 allele)
E2 = Pr(2 allele incorrectly coded as 1 allele)
Both entries must be positive real numbers less than 1.0.
Douglas Skol Boehnke (DSB) error model
The parameter settings for this error model are:
Gamma = Pr(homozygous 11 or 22 genotype incorrectly coded as
heterozygote 12)
Eta = Pr(heterozygote 12 genotype incorrectly coded as homozygote
11 or 22)
Both entries must be positive real numbers less than 1.0.
Note: for the Eta parameter, it is assumed that the 12 genotype
has an equal probability (0.5) of being incorrectly coded as 11 or 22. Also,
the notation used here comes from the Gordon et al. (2002) reference.
Sobel Papp Lange (SPL) error model
The parameter settings for this error model are:
V1 = Pr(true homozygote incorrectly coded as
heterozygote)
V2 = Pr(one homozygote incorrectly coded as another
homozygote)
V3 = Pr(true heterozygote incorrectly coded as a
homozygote)
Note: This parameterization of the SPL error model is an
improvement over the parameterization previously used (Gordon et al. 2002) in
that it only requires three parameter settings. The author gratefully
acknowledges S. Seaman and P. Holmans for the improvement.
All entries must be positive real numbers less than 1.0, subject
to the following constraints:
V1 + V2 < 1.0
V3 < 0.5
Mote and Anderson (MA) error model
The parameter settings for this error model are:
e21 = Pr(12 genotype observed | 11 true)
e31 = Pr(22 genotype observed | 11 true)
e12 = Pr(11 genotype observed | 12 true)
e32 = Pr(22 genotype observed | 12 true)
e13 = Pr(11 genotype observed | 22 true)
e23 = Pr(12 genotype observed | 22 true)
The following constraints are needed for the MA error model:

The MA error model is the most robust error model in that it
completely characterizes all other error models given certain constraints.
Therefore, it is the “best” error model to use. However, it comes with a
computational price. It requires three more parameters to be maximized than the
SPL model, and four more than the GLHO and DSB error models.
3.0 INTERPRETING RESULTS FROM TDTae OUTPUT
A critical ingredient in running the TDTae analysis is
interpretation of the outcome. The program produces MLEs of parameter
estimates, values for the TDTae statistic, and uncorrected and corrected (for
multiple testing) p-values. Headings for each of the parameters are as follows:
r1: MLE of the genotypic relative risk
under alternative (H1) and null (H0) hypotheses.
r2: MLE of the genotypic relative risk
under alternative (H1) and null (H0) hypotheses.
p11: MLE of the genotype frequency
under alternative (H1) and null (H0) hypotheses – note that
the allele being tested is considered the “2” allele for estimation purposes.
p12: MLE of the genotype frequency
under alternative (H1) and null (H0) hypotheses – note that
the allele being tested is considered the “2” allele for estimation purposes.
LogLike: Maximum log-likelihood estimates of the data under
alternative (H1) and null (H0) hypotheses using two-stage search procedure (for
purposes of programming,
(LogLike) is
minimized rather than LogLike being maximized).
LRT: The TDTae statistic - this quantity is given by the formula
-2 [LogLike(H1)-LogLike(H0)].
P: p-value (uncorrected for multiple testing) corresponding to the
LRT statistic for allele being tested.
Corrected: P-value corrected for multiple testing. The correction
is done as follows: if k alleles at marker locus are tested, and p
is the uncorrected p-value corresponding to a particular allele, then the
corrected p-value is given by
. See our paper (Gordon et al. 2004) for more details. Note that for SNPs, no
correction for multiple testing is performed.
Also provided are MLEs for all error model parameters. See Section
2.3.1 above for the list of different error model parameters.
3.1 Example runs
We present here the results of some example runs. The first
example uses the simulated data provided in this distribution (pedsim-err.txt
and markers-pedsim.txt).
We comment that the results file below was created by typing:
>tdtae -f marker-pedsim.txt –n 20 –o pedsim-tdtae.out
pedsim-err.pre GLHO
Note that we chose the error model of Gordon et al. (Gordon et al. 2001) for this analysis, because we simulated
the data according to that error model.
Results from
program TDTAE Version 2.01 using NR library
Written By
Chad Haynes and Derek Gordon
Please email
tdtae@linkage.rockefeller.edu with any bugs or problems
Locus #1
SNP1
Allele #1 (875 occurrences - 29.2%)
MLE r1 r2
p11 p12 E1 E2
LogLike LRT P Corrected
H1: 1.048007 1.508604
0.583394 0.378448 0.079448 0.000010 1287.559815
1.568215 0.456527 0.456527
H0: 1.000000 1.000000
0.575082 0.385035 0.075341 0.000010 1288.343922
Locus #2
SNP2
Allele #1 (770 occurrences - 25.7%)
MLE r1 r2
p11 p12 E1 E2
LogLike LRT P Corrected
H1: 0.860711 0.338686 0.599816
0.358233 0.055971 0.000010 1214.811515 5.065659 0.079434 0.079434
H0: 1.000000
1.000000 0.448872 0.466392
0.000010 0.180059 1217.344344
Locus #3
SNP3
Allele #1 (883 occurrences - 29.6%)
MLE r1 r2
p11 p12 E1 E2
LogLike LRT P Corrected
H1: 0.942563 1.324528
0.580013 0.345977 0.091583 0.084099 1330.769422
0.683097 0.710669 0.710669
H0: 1.000000 1.000000
0.608122 0.328371 0.101303 0.047451 1331.110970
Locus #4
SNP4
Allele #1 (662 occurrences - 22.4%)
MLE r1 r2
p11 p12 E1 E2
LogLike LRT P Corrected
H1: 0.775813 0.123752
0.559178 0.362844 0.032243 0.186968 1159.912489
15.579787 0.000414 0.000414
H0: 1.000000 1.000000
0.716827 0.251595 0.085297 0.014342 1167.702383
These data were simulated so that the first three markers (SNPs
1-3) are null and the last SNP is in both linkage and linkage disequilibrium
with a trait locus. Also, approximately 25% of the parents in this file were
not genotyped, and genotyping error was simulated according the GLHO model with
each error parameter being set to 0.10. Because all data were simulated
independently, a Bonferroni correction is appropriate. Thus, we see that, even
in the presence of missing parental data and genotyping errors, the TDTae
method provides accurate information in that it indicates that the trait locus
is located near SNP 4. The TDTae statistic is not significant at the 5% level
for any other marker after the Bonferroni correction.
We also note that, despite the relatively large sample size (500
trios), error parameter estimation is not consistent from marker to marker.
Thus, error parameter estimation should be used with caution when considering
data from trios.
Note that for two of the loci (#2 and #4), MLEs for genotypic
relative risk values
and
for allele 1 are both less than 1. These values can be
converted to genotypic relative risks for the “non-1” allele using the formulas
where the prime superscript indicates genotypic relative risk
for the “non-1” allele.
In the next example, we present results for selected markers from
the Sitosterolemia data (Lee et al. 2001) provided in this distribution. The
pedigree file is sito-ped.txt and the marker file is markers-sito.txt.
We choose the DSB model for our error model, although, as the output file
indicates, there are no observed genotyping errors in this data set. Also,
because we know that the disease is inherited in a recessive fashion, we chose
the “-r” option when running our analyses (Section 2.1.1 – Command Line Options).
The advantage of using this option is that there is only one degree of freedom
for the corresponding TDTae (LRT) statistic. Also, we chose the “-a 20” option
to allow testing for alleles whose minimal number of occurrences is 20.
We comment that the results file below was created by typing:
>tdtae –a 20 -f markers-sito.txt –n 20
–v –r -c 10 –o sito-tdtae.out sito-ped.txt DSB 13 15 20
Results from
program TDTAE Version 2.01 using NR library
Written By
Chad Haynes and Derek Gordon
Please email
tdtae@linkage.rockefeller.edu with any bugs or problems
Locus #13
D2S4009
Allele #2 (39 occurrences - 23.5%)
MLE r1 r2
p11 p12 Gamma Eta
LogLike LRT P Corrected
H1: 1.000000 8.987662
0.606503 0.305744 0.000000 0.000000
52.606985 6.274741 0.012267
0.012267
H0: 1.000000 1.000000
0.592418 0.318584 0.000000 0.000000 55.744355
Locus #15
D2S2298
Allele #2 (98 occurrences - 57.6%)
MLE r1 r2
p11 p12 Gamma Eta
LogLike LRT P Corrected
H1: 1.000000
22.562454 0.245505 0.554387 0.000000
0.000000 59.997096 35.35158
0.000000 0.000000
H0: 1.000000 1.000000
0.228886 0.552130 0.000000 0.000000 77.672887
Locus #20
D2S2174
Allele #1 (37 occurrences - 22.3%)
MLE r1 r2 p11 p12 Gamma Eta LogLike
LRT P Corrected
H1: 1.000000 1.479179
0.575724 0.327053 0.000000 0.000000
55.185774 0.158508 0.690551 0.904241
H0: 1.000000 1.000000
0.574890 0.326453 0.000000 0.000000 55.265027
Allele #2 (38 occurrences - 22.9%)
MLE r1 r2
p11 p12 Gamma Eta LogLike LRT
P Corrected
H1: 1.000000 10000.00
0.691292 0.270247 0.000000 0.000000
41.537543 21.31348 0.000004
0.000008
H0: 1.000000 1.000000
0.682654 0.278131 0.000000 0.000000 52.194284
Allele #4 (45 occurrences - 27.1%)
MLE r1 r2
p11 p12 Gamma Eta
LogLike LRT P Corrected H1: 1.000000 4.452777
0.532816 0.397743 0.000000 0.000000 57.419163 4.988331 0.025541 0.050429
H0: 1.000000 1.000000
0.527557 0.397515 0.000000 0.000000 59.913329
There are a few interesting things to note about this output.
First, the TDTae statistic is performed for several alleles at each marker. As
can be noted by studying the maximum LRT value for each marker and the
corresponding minimal corrected p-value, the results are highly significant. We
comment that genotype relative risk estimates for
are large for each marker.
We comment that the location of the “Sitosterolemia” genes, ABCG5
and ABCG8, are approximately 20,000 base pairs from marker D2S2298 (Lee et al. 2001; Lu et al. 2001).
3.2 A note about
robustness to population stratification
We comment that our likelihood method as presently designed may
not be robust to population stratification, the original reason why the TDT and
other statistics were developed. We quote from our European Journal of Human
Genetics paper: “We note that we assumed that the mating
type frequencies of founders is given by the product of the individual genotype
frequencies, unlike Weinberg (Weinberg
1999).
We make this simplification to reduce the number of parameters that we must
maximize in finding the maximum log-likelihood of the data. While it may be
more powerful to use the six mating types, we comment that our simplification
reduces the number of parameters to be estimated by three. However, our
assumption does make our statistic potentially non-robust to population
stratification, the original condition for which the TDT and other statistics
were developed (Falk
and Rubinstein 1987; Spielman et al. 1993).
We plan to extend our method to handle the more general mating-type frequencies
proposed in Weinberg’s work (Weinberg
1999).”
We therefore caution researchers regarding interpretation of
results when using TDTae version 2.0 on data that is potentially stratified due
to population admixture.
4.0 MAXIMIZATION PROCEDURES
4.1 Maximization method used
When applying our test statistic,
we perform a two-stage maximization procedure. We first compute the
log-likelihood under the null and alternative hypotheses using a lattice of
points from a multi-dimensional rectangle. We “cut” the cube into a pre-specified
number of intervals, and compute the log-likelihoods for the endpoints of each
of the intervals. The number of cuts can be user-specified (see “-c” option –
Section 2.1.1) and the default setting is five cuts. For example, if we
consider the SPL error model, and specify four cuts, then the parameters
,
, and V1 through V3, all of which have
values in the interval [0,1], will be tested at 4+1 = 5 values: 0.0, 0.25, 0.5,
0.75, and 1.0. For the relative risk parameters
and
, we initially consider the interval [0, 20]. Thus, in the
first stage of our maximization, the log-likelihood is computed for
values under the
alternative hypothesis, and for
values under the null, where c is the number of cuts
specified, and e is the number of error model parameters for a given
error model. The parameter e = 2, 3, and 6 for the DSB, SPL, and MA
error models, respectively.
Once the log-likelihoods are computed in the first stage, the parameter values that provide the top n log-likelihoods under each hypothesis are then used as starting values for the Powell maximization procedure (Acton 1970; Brent 1973; Jacobs 1977). The value n may be user specified (see “-n” option – Section 2.1.1). The default setting for this number is five. We use the Powell procedure as implemented in the “Numerical Recipes in C” text (Press et al. 2002). The largest log-likelihood from each set of n runs is then chosen as the maximum log-likelihood for each hypothesis.
4.2 Potential maximization issues
We have performed extensive analyses with this program.
Occasionally, we have seen LRT values for certain alleles that are less than 0,
indicating that the maximum log-likelihoods were not found under H1.Typically,
this result is caused by using alleles whose minor allele frequency is small (less
than 10%). One potential solution is to re-run the analysis using a
sufficiently large minor allele frequency (i.e., choose the command line option
“-a x”) where x is an integer greater than or equal to 20. If the
problem persists, please contact us at the e-mail listed below (Section 5.0).
4. 3 A note about computation time
We comment that the time necessary to complete maximization and
therefore to determine the LRT value for a given marker allele is dependent
upon the number of individuals in a pedigree and also the number of parameters
being maximized (see also Section 4.1). As the number of individuals in a
pedigree grows, the computation time for our method also increases. Therefore,
our method at present is most efficient for small nuclear families in which
there are no genotype errors. We are presently researching approximate
likelihood solutions to reduce the computational time necessary to compute the
LRT values for any pedigree.
5.0 PROBLEMS? COMMENTS?
If there are problems in the execution or compilation of this
program or if you would like to provide some feedback, please e-mail
tdtae@linkage.rockefeller.edu.
6.0 ACKNOWLEDGEMENTS
The authors of this software gratefully acknowledge grant
K01-HG00055 from the National Institutes of Health. The
psoriasis study for which example data are provided is funded in part by NIH grant AR049049.
7.0 REFERENCES
Below are references for this README file. Please cite:
Gordon D, Haynes C, Johnnidis C, Patel S, Bowcock AM, Ott J (2004) A transmission disequilibrium test for general pedigrees to the presence of random genotyping errors and any number of untyped parents. European Journal of Human Genetics 12:752-61
Gordon D, Heath SC, Liu X, Ott J (2001) A
transmission/disequilibrium test that allows for genotyping errors in the
analysis of single-nucleotide polymorphism data. American Journal of Human
Genetics 69:371-380
when reporting results
obtained by using the TDTae program.
Acton FS (1970) Numerical methods that work. Mathematical Association of America, Washington, DC
Brent RP (1973) Chapter 7. In: Algorithms for minimization without derivatives. Prentice-Hall, Englewood Cliffs, NJ
Douglas JA, Skol AD, Boehnke M (2002) Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am J Hum Genet 70:487-495
Edwards AWF (1992) Likelihood. The Johns Hopkins University Press, Baltimore
Falk CT, Rubinstein P (1987) Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Annals of Human Genetics 51:227-233
Gordon D, Haynes C, Johnnidis C, Patel S, Bowcock AM, Ott J (2004) A transmission disequilibrium test for general pedigrees to the presence of random genotyping errors and any number of untyped parents. European Journal of Human Genetics 12:752-61
Gordon D, Heath SC, Liu X, Ott J (2001) A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. American Journal of Human Genetics 69:371-380
Helms C, Cao L, Krueger JG, Wijsman EM, Chamian F, Gordon D, Heffernan M, Daw JA, Robarge J, Ott J, Kwok PY, Menter A, Bowcock AM (2003) A putative RUNX1 binding site variant between SLC9A3R1 and NAT9 is associated with susceptibility to psoriasis. Nat Genet 35:349-356
Jacobs DAH (1977) The state of the art in numerical analysis. Academic Press, London
Lee MH, Gordon D, Ott J, Lu K, Ose L, Miettinen T, Gylling H, Stalenhoef AF, Pandya A, Hidaka H, Brewer B, Jr., Kojima H, Sakuma N, Pegoraro R, Salen G, Patel SB (2001) Fine mapping of a gene responsible for regulating dietary cholesterol absorption; founder effects underlie cases of phytosterolaemia in multiple communities. European Journal of Human Genetics 9:375-384
Lu K, Lee MH, Hazard S, Brooks-Wilson A, Hidaka H, Kojima H, Ose L, Stalenhoef AF, Mietinnen T, Bjorkhem I, Bruckert E, Pandya A, Brewer HB, Jr., Salen G, Dean M, Srivastava A, Patel SB (2001) Two genes that map to the STSL locus cause sitosterolemia: genomic structure and spectrum of mutations involving sterolin-1 and sterolin-2, encoded by ABCG5 and ABCG8, respectively. American Journal of Human Genetics 69:278-290
Mote VL, Anderson RL (1965) An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2002) Numerical Recipes in C. The art of scientific computing. Cambridge University Press, Cambridge
Sobel E, Papp JC, Lange K (2002) Detection and integration of genotyping errors in statistical genetics. American Journal of Human Genetics 70:496-508
Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics 52:506-516
Weinberg CR (1999) Allowing for missing parents in genetic studies of case-parent triads. American Journal of Human Genetics 64:1186-1193