User's Guide to the LRTae program (Version 1.3.1)

=========================================

 

TABLE OF CONTENTS

 

1.0   OVERVIEW AND THEORY

            1.1       Terminology

1.1.1        Maximum likelihood estimates of parameters using Expectation-Maximization algorithm

1.1.2        Asymptotic and permutation p-values

 

2.0   RUNNING THE LRTae PROGRAM

2.1         Input files

2.1.1              Description file

2.1.2              Fallible data file

2.1.3              Infallible data file

2.1.4              Distinguishing among different types of genomic data

2.2         Usage

2.2.1              Command line options

2.3          Example files with this distribution

3.0   INTERPRETING RESULTS FROM LRTae OUTPUT

3.1                    A note about selection of double-samples

3.2                    Usage of LRTae with haplotype pairs

 

4.0  PROBLEMS? COMMENTS? (contact information)

 

5.0  ACKNOWLEDGEMENTS

 

6.0  REFERENCES

 

1.0  OVERVIEW AND THEORY

 

This program is compiled to work in UNIX Solaris, LINUX, and Windows (PC) operating systems. All commands are executed from the command line in UNIX and LINUX or DOS prompt in Windows.

 

One of the more vexing problems in association analysis with cases and controls is misclassification error (e.g., a true case is incorrectly labeled as a control or a homozygote genotype 1 1 is labeled as a heterozygote 1 2). Consequences of such misclassifications are a reduction in the power to detect association and biased estimates or parameters such as genotype, haplotype, or multi-locus genotype frequencies in cases and controls. These misclassifications are of primary concern for case/control genetic association studies since there is no way to detect such errors in phenotype or genotype data short of repeated sampling (or double-sampling – see below).  We developed a method, the Likelihood Ratio Test Allowing for Errors (LRTae) to treat this problem (1). The purpose of this software is to compute the LRTae on genotype and phenotype data provided by the user.

 

1.1 Terminology

A key assumption for use of this program is that double-sample phenotype and/or genomic data (either genotypes or haplotypes – see below) are available for some subset of the individuals. By double-sample we mean a sample (either phenotype, genotype, or haplotype) that has been classified twice in a particular manner (2, 3). The first classification (called the fallible classification) is determined by a standard procedure that is subject to misclassification error. It is available on every individual. The second classification (called the infallible classification) is determined by a “gold-standard” procedure that has much lower misclassification error than the fallible classification mechanism. It is typically available on only a subset of individuals. An example of a double-sample for phenotype comes from Alzheimer’s Disease. The fallible classification procedure is phenotype determination using a clinical dimentia instrument and the infallible procedure is autopsy-proven plaques and tangles. For genotypes, the fallible classification is the current genotyping technology and the infallible classification procedure may be forward and reverse sequencing. For haplotypes, the fallible classification procedure may be statistically assigned haplotypes based on multi-locus genotype data and the infallible classification may be molecular haplotype calls. Individuals for whom both classifications are available are called double-samples. We also use the term genomic data to describe either genotype, haplotype-pair, or multi-locus genotype data.

This program performs a likelihood ratio test on genomic data for cases and control individuals. It uses double-sample data to: (i) estimate the phenotype and genomic data misclassification probabilities; (ii) compute asymptotically unbiased estimates of genomic data frequencies in cases and controls; and (iii) increase power for genetic association.   While we do not present full details of the theory in this user guide, we note that full documentation for our method is available online (http://www.bepress.com/sagmb/vol3/iss1/art26/) or from the authors (lrtae@linkage.rockefeller.edu).

 

1.1.1    Maximum likelihood estimates of study parameters using Expectation-Maximization algorithm

Maximum likelihood estimates (MLEs) of parameters such as the sampling proportions of cases and controls in the data set and the specific genomic data frequencies in cases and controls under the null and alternative hypotheses are determined using the Expectation-Maximization (EM) method (4). Parameter estimates are updated until the absolute difference between the sum of the nth step and n + 1st step estimates is no greater than 10-9 (summed over all parameters). That is, the stopping condition for parameter estimation is that , where is the nth step estimate of the ith parameter.  

 

1.1.2    Asymptotic and permutation p-values

The null hypothesis (H0) for our statistic is that each observed genotype, haplotype-pair, or multi-locus genotype frequency (depending upon the genomic data) in cases and controls is equal. The alternative hypothesis (H1) is that at least one genotype, haplotype-pair, or multi-locus genotype has a different frequency between cases and controls. Log-likelihoods for the data under each of the hypotheses are computed and our LRTae test statistic is twice the difference of the log-likelihoods. Asymptotically, under H0, the LRTae statistic is distributed as a central chi-square distribution with k-1 degrees of freedom, where k is the number of distinct genotype (or haplotype-pair) categories observed (5). In situations where the number of observations of a particular genomic classification may be small, the asymptotic distribution may not be valid. In such situations, we compute p-values for the LRTae statistic via permutation. However, because we have double-sample data, permutation testing is not so straightforward. We consider two situations:

(a)    Double-sample information is available only for genomic data – In such situations, we randomly permute phenotype status for all individuals, keeping the total number of cases and controls fixed in each replicate. The permutation p-value is then the proportion of replicates for which the LRTae statistic exceeds the observed LRTae statistic. Exact confidence intervals for the permutation p-value are computed using the method implemented in the BINOM software (6). 

(b)   Double-sample information is available only for phenotypes – In such situations, we randomly permute genomic data status for all individuals, keeping the total number of observations of a particular genomic classification fixed in each replicate. As above, the permutation p-value is the proportion of replicates for which the LRTae statistic exceeds the observed LRTae statistic. Exact confidence intervals for the permutation p-value are computed using the method implemented in the BINOM software (6).

While we conjecture that the p-values determined via our permutation procedures (described in items (a) and (b)) will achieve correct significance levels, we note that we have performed no simulation studies to date to verify these procedures. Therefore, users should interpret p-values using the permutation procedure with caution.

            Finally, we note that we perform the same procedures for the LRTstd method.

 

2.0 RUNNING THE LRTae PROGRAM

 

2.1 Input files

 

The LRTae program (version 1.0) requires three files as input. They are: (i) the phenotype and genotype description file; (ii) the fallible data file; and (iii) the infallible data file. Full details regarding format and examples for each type of file are presented below.  

 

2.1.1 Description file

 

This file contains the information identifying the phenotypes and genotypes in the fallible and infallible data files (items 2.1.2 and 2.1.3). There are a minimum of two space-separated fields per row for this file. The first field indicates the nature of the categorical variable in second column for the corresponding row. Choices are: P = Phenotype; A = Genomic data marker with genotypes or haplotype-pairs coded as two separate alleles or haplotypes; C = Marker locus with genotypes coded by a single number; M = Multi-locus genotype or haplotype-pair. As an example to distinguish between symbols A and C, consider a SNP locus containing genotypes DD, Dd, and dd. Using the symbol A, individual genotypes are written in the fallible and infallible data files using two fields. For example, the two fields “1 1”, “1 2”, and “2 2” may represent the genotypes DD, Dd, and dd. Note that this is the same coding that is used for genotypes in the LINKAGE programs (7). Using the symbol “C”, genotypes are coded using a single number. Using our example, we may make the assignments: “1” = DD, “2” = Dd, “3” = dd, although this assignment is arbitrary. This coding is similar to the coding for genotypes used in the EH program (8). Also see http://linkage.rockefeller.edu/ott/eh.htm .

The second field in this file is the name of the variable corresponding to the symbol in the first field. This variable name is also used in the 1st line of both the fallible and infallible data files to indicate the nature of the corresponding columns in those files.

An optional field may follow the variable name for non-multi-locus genotype data indicating the symbol that is used for missing data in the fallible and infallible data files.  If no value is provided, then the program will assume the default settings that “-1” is the code for missing phenotype data while “0” is the missing code for genomic data. Using the default settings, one uses “0 0” for an individual’s missing genomic data if the data are coded using the A symbol and “-1” for an individual’s missing genomic data if the data are coded using the C symbol. If one wants to use the value of –99 for missing genomic data, then similarly one types “-99 -99” for an individual’s missing genomic data when the data are coded using the A symbol and “-99” for an individual’s missing genomic data when the data are coded using the C symbol.  We present an example description file’s contents below.

For markers specified by the “C” symbol, allele configurations can be specified for use in down-coded analysis.  The code definitions are added to the row after the marker name and, if present, after the unknown code.  The format should be “code=allele1,allele2” where code is the genotype or haplotype-pair code and allele1 and allele2 are the alleles/haplotypes that correspond to code.  Note that there are no spaces around the ‘=’ or ‘,’.  There must be a definition for each code if you wish to use the allele down-coding analysis option –d (see Section.

            Multi-locus markers require a space-separated list of marker names to appear following the multi-locus marker name.  Each marker in the list must appear in the data description file and be of type “A” or “C”.

            Any thing appearing after a “#” symbol is considered a comment and is ignored.

 

(Example description file)

P   pheno1    –1   # comments

P   pheno2    –99

A   SNP1      95

C   Mkr2      1=1,1 2=1,2 3=2,2

C   Mkr3      -2 3=3,3 4=3,4 5=3,5 6=4,4 7=4,5 8=5,5

M   multi SNP1 Mkr2

 

In this file, we see that there are six variables of interest. Two correspond to phenotypes and four correspond to genomic data. The first phenotype variable is called “pheno1” and the second is called “pheno2”. The first genomic variable is labeled “SNP1”, the second is labeled “Mkr2”, the third is labeled “Mkr3”, and the fourth is “multi”.  For the variable pheno1, missing data is indicated by a “-1” code, while for the variable pheno2, missing data is indicated by a “-99” code. For variable SNP1, missing data is indicated by a “95” code. There is no specified code for missing data with the variable Mkr2, so the default code “-1” is used.  The allele definitions for the 3 different genotypes are listed after the name label.  For variable Mkr3, missing data is specified by a “-2” value. The list of the 6 genotype codes follows the missing data specification.  The multi-locus genotype marker “multi” is comprised of markers “SNP1” and “Mkr2”.

 

2.1.2    Fallible data file

This file contains phenotype and genomic classifications for all individuals as measured with the fallible measuring instrument (see section 1.1 for definition of fallible). The format for this file is as follows:

 

Ind_ID   Order_of_ phenotype_and genomic_data

 

There are two key formatting issues regarding this file and the infallible data file (2.1.3). They are:

 

1)      The first line of the file always consists of the order of the phenotype and genomic data. The order is determined by using the variable names provided in the description file (item 2.1.1).

2)      The first column of each row is always the individual ID (Ind_ID), which must be an alphanumeric string of characters.

 

The format for this file is best explained through an example. Suppose that we have an example description file as provided in item 2.1.1 above. Then a fallible data file might look like:

 

(Example fallible data file)

 

pheno1 SNP1 Mkr2 pheno2

A1   0   1  1   2   1

U2   1   1  2   1 -99

A3   0   2  2   3   0

A4   1  95 95   1   1

U12  1   1  2   2   0

 

In this example, we provide fallible phenotype and genomic classifications for five individuals. As mentioned above, the first column for both the fallible and the infallible data files MUST be the individual ID and it must be an alphanumeric string.

As indicated in the example description file, the first column after the Individual code column is the individual’s classification for the phenotype pheno1. All phenotype classifications must always be either 0 (case) or 1 (control). In this example file, we see that the fallible classification of the phenotype “pheno1” for the five individuals are (respectively): case, control, case, control, control.

Because the 1st line of this file indicates that the data after the phenotype pheno1 is genotype data for the variable SNP1 and genotypes are coded using two columns (one for each allele – symbol A in the example description file), we can determine that the respective genotypes for individuals at this marker locus are: 1 1, 1 2, 2 2, missing data, and 12.

The 1st line of the example fallible data file indicates that the next data are coded genotypes for the variable “Mkr2” that are coded using a single column (symbol C in the example description file). In the example fallible data file, we observe three different haplotypes. The respective coded haplotypes for these individuals are: 2, 1, 4, 1, 2.

Finally, the last column in this example data file represents classification for a second phenotype, pheno2. Individuals have the respective phenotypes: control, missing data, case, control, case, using the fallible measuring instrument for phenotype pheno2.

 

2.1.3    Infallible data file

This file contains phenotype and genomic classifications for individuals as measured with the infallible measuring instrument (see section 1.1 for definition of infallible). Note that the list of individuals in this file is a subset of the list of individuals in the fallible data file. An individual is listed in this file only if the individual has an infallible measurement for at least one of the phenotypes and/or genomic data variables in the description file. The format for this file is as follows:

 

Ind_ID   Order_of_ phenotype_and genomic_data

 

As above, we illustrate with an example file. Suppose that the data below is the infallible data file corresponding to the example description and fallible data files above. Notice that individual A4 is missing from what we report. That means that there is no double-sample data for that individual. Also notice that there is no information for variable pheno2; that means that any analysis with the phenotype pheno2 will not include double-sample information on pheno2.

 

(Example infallible data file)

 

pheno1 SNP1 Mkr2

A1   0   1 1   2  

U2   1   2 2   0  

A3   0   2 2   3

U12  0   1 2   2        

 

In this example, we provide infallible measures for the phenotype variable pheno1, for the SNP marker locus variable SNP1, and for the coded genotype variable Mkr2. As mentioned above, the first column for both the fallible and the infallible data files MUST be the individual ID and it must be an alphanumeric string.

As indicated in the example description file, the first column after the Individual code column is the individual’s classification for the phenotype pheno1. All phenotype classifications must always be either 0 (case) or 1 (control). In this example file, we see that the infallible classification for the four individuals are (respectively): case, control, case, case. It is interesting to note that in these example files, individual U12 is classified as a case using the infallible classifier and as a control using the fallible classifier (Example fallible data file). This is an example of a phenotype misclassification. This information is used when computing the LRTae statistic (1).

Because the 1st line of this example file indicates that the data after the phenotype pheno1 is a genotype consisting of two columns (one for each allele – symbol A in the example description file), we can determine that the respective genotypes for individuals at this marker locus are: 1 1, 2 2, 2 2, and 1 2.

The 1st line of this example file indicates that the next data are coded genotypes for the variable Mkr2 (coded using a single column – symbol C in the example description file). In the example infallible data file, the respective coded genotypes for the four individuals are: 2, missing data (coded 0), 4, and 2.

Finally, the last column in this example data file represents classification for a second phenotype, P2. Individuals A1-A4 have the respective phenotypes: case, case, control, case using the fallible measuring instrument for phenotype P2.

 

2.1.4 Distinguishing among different types of genomic data

 

Here, we add a note about the distinction among genotype, haplotype-pair, and multi-locus genotype data. By genotype data, we mean data for a single marker locus. By haplotype-pair data, we mean data for haplotype pairs (see example in Section 3.2). By multi-locus genotype data, we mean the concatenation of single-locus genotypes. For example, suppose we have genotype data for three SNPs labeled SNP1, SNP2 and SNP3. Furthermore, suppose we have used assumed codes of 1, 2, and 3 for the more common homozygote, the heterozygote, and the less common homozygote at each SNP (see Section 2.1.1 above for description of genotype coding). Then, using all three SNPs, the multi-locus genotype (coded) for any individual consists of the string of the three individual genotype codes. For example, if an individual has coded genotypes 1, 2 and 2 at SNPs 1, 2 and 3, then the person’s multi-locus genotype is 122.  Similarly, if an individual has coded genotypes 3, 3, and 1, then the person’s multi-locus genotype is 331. Testing for association with multi-locus genotypes is accomplished by using the “multi” coding in the Description file (Section 2.1.1). Note that, if a “multi” coding is presented in the Description file, LRTae will automatically perform association testing using the multi-locus genotypes; no additional information is needed in either the Fallible or Infallible data files (Sections 2.1.2 and 2.1.3) other than genotype or haplotype-pair information for the individual loci.

 

2.2              Program usage

 

Usage of the program is as follows:

 

Usage: lrtae [OPTIONS] <fallible file> <infallible file> <description file>

 

We explain each item:

 

[OPTIONS]: A list of options that may be used when running LRTae (see Section 2.2.1).

 

<fallible file>: The name of the fallible data file (item 2.1.2)

<infallible file>: The name of the infallible data file (item 2.1.3)

<description file>: The name of the description file (item 2.1.1)

 

2.2.1    Command line options

 

        -p <# perms>         Permute phenotypes only

        -g <# perms>         Permute genomic data only

        -b <# perms>         Permute both phenotypes and genomic data

 

        -j <genomic_list>    Run statistics for the genomic data specified;

                                         Genomic_list should be a comma separated list

        -k <phenotype_list>  Run statistics for the phenotypes specified;

                                          Phenotype_list should be a comma separated list

 

        -c <conf_interval>   Set the confidence interval (default is 0.95 = 95%)

 

-d                        Downcode multi-allelic loci to snp loci (for testing association with a specific risk allele or haplotype)

 

        -o                   Specify output file

        -h                   Show help

        -v                   Show version

 

Note: If no genomic data or phenotype is specified all will be used       

 

We explain each of these options in the order of their appearance above.

 

-p <# perms>: This option assumes that double-sample information is only available for genomic data. Permutations are performed by randomly reassigning phenotypes, keeping each individual’s information on genomic classification fixed (including double-sample information). The total number of cases and controls is also fixed for each replicate. The number of permutations is specified by the user (#permutations). For example, typing “-p 10000” means that 10,000 permutations will be performed by randomly reassigning observed case and control status. For more details on p-values, see section 1.1.2 above. It is important to comment that if this option is chosen for a phenotype variable in which double-sample data is available, permutation p-values may not be valid.   

 

-g <# perms>: This option assumes that double-sample information is only available for phenotypes. Permutations are performed by randomly reassigning genomic data classification, keeping each individual’s information on phenotypes fixed (including double-sample information). The total number of each genomic classification is also fixed for each replicate. The number of permutations is specified by the user (#permutations). For example, typing “-g 10000” means that 10,000 permutations will be performed by randomly reassigning genomic classifications. For more details on p-values, see section 1.1.2 above. It is important to comment that if this option is chosen for a genomic data variable in which double-samples are available, permutation p-values may not be valid.   

 

Note: For any run of the LRTae program, only one of the options “-p” or “-g” may be chosen.

 

-b <# perms>: With this option, permutations are performed on both phenotypes and genomic data. THIS OPTION IS NOT RECOMMENDED AT PRESENT – MORE RESEARCH MUST BE DONE TO EVALUATE P-VALUES DETERMINED USING THIS OPTION.

 

-j <genomic_list>: This option is chosen if the user only wants to perform LRTae analysis on a subset of the genomic data provided in the data files (items 2.1.1-2.1.3). For example, suppose that genotype and/or haplotype pair data are available for variables Mkr1, Mkr2, Mkr3, Mkr4, Hap1, Hap2, and Hap3. If this option is not chosen, then the program will compute the LRTae statistic for all genomic data. However, if the user types the option:

            “-j Mkr1,Mkr3,Hap1”

then the LRTae statistic will only be computed for Mkr1, Mkr3, and Hapl variables.

 

Note: With this option, the list of genomic data variables MUST be separated by a comma and there can be no spaces between variables. For example, the program will not run if the user types “-j Mkr1,  Mkr3,  Hapl” with spaces after the commas.

 

-k <phenotype_list>: This option is chosen if the user only wants to perform LRTae analysis on a subset of the phenotype variables provided in the data files (items 2.1.1-2.1.3). For example, suppose that phenotype data are available for variables p1, p2, p3, and p4. If this option is not chosen, then the program will compute the LRTae statistic for all four variables. However, if the user types the option:

            “-k p2,p4”

then the LRTae statistic will only be computed for the p2 and p4 variables.

 

Note: As above (-j option), the list of phenotype variables MUST be separated by a comma and there can be no spaces between variables. For example, the program will not run if the user types “-k p2,  p4” with spaces after the comma.

 

-c <confidence_interval>: This option enables the user to specify the percent confidence interval for the p-value obtained by permutation. For example, if the user types “-c 0.x”, then a x% confidence interval centered about the permutation p-value will be provided. So, if the user types “-c 0.99”, then a 99% confidence interval will be calculated and if the user types “-c 0.90” then a 90% confidence interval will be calculated. This confidence interval is determined using the method implemented in the BINOM program (see http://linkage.rockefeller.edu/ott/linkutil.htm#BINOM for more information). If this option is not selected and permutations are performed, the default confidence interval provided is 95%.  

 

-d: This option allows you to down-code genotypes based on the alleles.  For each allele present, the analysis will be run after recoding the alleles to be 1 if it is not equal to the target allele and 2 if it is equal to the target allele.  Thus if we were considering the allele “1” we would recode a genotype from “1 3” to “2 1” and from “3 2” to “1 1”. This option is provided so that users who wish to test a certain allele as a “risk” allele can reduce the number of genotypes to 3 and thereby reduce the degrees of freedom for testing association. 

 

Note: This option will only work with “A”-type markers and any “C”-type markers that have the allele-code definitions specified in the data description file (see example in Section 2.1.1 above).

 

-o: This option allows the user to specify the name of the output file, containing results of

all analyses.

 

-h: Choosing this option will provide the user with a list of command-line options (Section 2.2.1).

 

-v: This option provides the user with the version of the program being run.

 

2.3 Example files with this distribution

 

We provide two sets of example data with this program. Both are simulated data sets. The first contains double-sample genotype data for a SNP marker. The fallible, infallible, and description files are: sim-snp-fall.txt, sim-snp-infall.txt, and sim-snp-desc.txt, respectively. The output file is labeled sim-snp-lrtae.out.

 The second set of files contains information for two phenotypes (labeled simP1 and simP2) and for coded genotypes (labeled simMicro1; here Micro stands for “Microsatellite” marker). The fallible, infallible, and description files are: sim-micro-fall.txt, sim-micro-infall.txt, and sim-micro-desc.txt, respectively. For the second set of files, data are simulated so that the first phenotype is associated with the coded genotypes and the second phenotype is not. The output file for this data set is labeled sim-micro-lrtae.out.

 

If we type:

 

> lrtae –o sim-snp-lrtae.out –p 10000 –c 0.99 sim-snp-fall.txt sim-snp-infall.txt sim-snp-desc.txt

 

at the command line, then the program will compute the LRTae statistic using the observed phenotype data and the double-sample data for SNP genotypes. The program also computes permutation p-values by randomly permuting case/control status for all individuals. A total of 10,000 permutations are performed and a 99% confidence interval centered around each permutation p-value is computed. The output is presented in a file called “sim-snp-lrtae.out”. We present the contents of this file in the next section.

 

3.0 INTERPRETING RESULTS FROM LRTae OUTPUT

 

As with any statistical analysis program, critical to the success of the analysis is an accurate interpretation of the results. We provide a sample output file, determined by running the LRTae program on one sample data set above (see section 2.3 above).

 

LRTae program version 1.3.1

Written by Chad Haynes

Supervised by Derek Gordon

 

Input files:

sim-snp-desc.txt    (Description file)

sim-snp-fall.txt    (Fallible data file)

sim-snp-infall.txt  (Infallible data file)

 

Command line options selected:

10000 Permutations - Randomly reassign phenotypes

Output file name – sim-snp-lrtae.out

 

Program run Tue Jan 04 17:51:57 2005

 

 

**************************************

**  Genomic Data Marker: SNP1       **

**  Phenotype Marker:    pheno1     **

**************************************

Population Frequencies:

                 1/1     1/2     2/2    

  LRTae  Case    0.00000 0.09843 0.90157

         Control 0.02619 0.22172 0.75209

 

  LRTstd Case    0.02400 0.10800 0.86800

         Control 0.05200 0.21600 0.73200

 

Misclassification:

  Genomic Data   (Obs)

                 1/1     1/2     2/2     N      

  (True) 1/1     1.00000                 1      

         1/2     0.08000 0.92000         25     

         2/2     0.01064 0.02128 0.96809 94     

 

  Phenotype      (Obs)

                 Case    Control N      

  (True) Case    1.00000         0      

         Control         1.00000 0      

 

LRT Statistics:

         Hyp   LogLike LRT      Df Asym-P  Perm-P  99.0% Permutation CI

 LRTae   H1    647.82  18.13652 2  0.00012 0.00000 (0.00000, 0.00046)

         H0    656.89 

 LRTstd  H1    291.47  14.70875 2  0.00064 0.00100 (0.00037, 0.00214)

         H0    298.82 

 

Please cite the following when reporting results from the LRTae program:

 

Gordon D, Yang Y, Haynes C, Finch SJ, Mendell NR, Brown AM, and Haroutunian V (2004) Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet and Mol Biol 3:Article 26.

http://www.bepress.com/sagmb/vol3/iss1/art26/.

 

We first notice that the program output indicates what genomic data (SNP1) and phenotype (pheno1) variables are being considered. Next, the program provides MLEs of the individual SNP genotype frequencies in cases and controls using the LRTae method and the LRTstd method (i.e., the method that does not use the double-sample information.) Notice that the results are different. Using the double-sample information, the MLEs for the 1 1 genotype in cases and controls with the LRTae method are 0.000 and 0.026, respectively (genotypes are written with a “/” symbol to separate the alleles). Using the LRTstd method, the MLEs for the 1/1 genotype in cases and controls are 0.024 and 0.052, respectively. Similarly, for the 2/2 genotype, the LRTae method provides MLEs of 0.902 and 0.752 for cases and controls (respectively) while the LRTstd method provides MLEs of 0.868 and 0.732 (respectively).

What accounts for this difference? One can glean the answer by studying the genotype misclassification table. In this table, we provide MLEs of the misclassification probabilities, namely Pr(observed genotype = a | true genotype = b), where a and b are any of the three genotypes 1/1, 1/2, or 2/2. The last column in this table, with the heading “N” provides the sample size on which these calculations are determined. So for example, we see that, based on the genotype 94 double-sample observations, the 2/2 genotype is misclassified approximately 2% of the time as the heterozygote 1/2 and approximately 1% of the time as the homozygous 1/1 genotype. Similarly, based on 25 observations, the heterozygote 1/2 genotype is misclassified approximately 8% as the 1/1 genotype. Therefore, many of the “observed” 1/1 genotypes are really misclassified heterozygotes and 2/2 genotypes. This accounts for the decreased 1/1 genotype frequency estimates with the LRTae method as opposed to the LRTstd method.

Notice also that, because we have no double-sample data for the phenotype variable pheno1, the classification is assumed to be “perfect”. Also, the sample size N for the phenotype misclassification is 0.        

            The log-likelihoods under the null (H0) and alternative (H1) hypotheses (also see Section 1.1.2) of the LRTae and LRTstd methods are provided below the misclassification tables. For each method, twice the difference of the log-likelihoods provides the value of the test statistic. We compute respective values of 18.137 and 14.709 for the LRTae and LRTstd methods. Since this data set was simulated by randomly simulating errors into genotypes, the results are consistent with the findings of Mote and Anderson (9) and others (10, 11) that the LRTae statistic has greater significance. The added power comes from the correctly classified genotypes, which are not used by the LRTstd method (1). Note also that the log-likelihoods for the LRTae method are larger, indicating that additional information is being used. Also, the output file provides the degrees of freedom (Df) for each likelihood ratio test (LRT). In this example, there are two degrees of freedom for each test, since there are three genotype classification categories possible for the di-allelic locus being tested. The asymptotic p-value is computed assuming that the null distribution is a central chi-square distribution with the indicated degrees of freedom. The asymptotic p-value is the probability of observing the corresponding LRT statistic value assuming the null distribution is correct.

Because the LRTae method estimates the 1/1 genotype frequencies to be small (0 for cases!), we perform a permutation analysis to determine the accuracy of the p-value based on asymptotic theory. Because we have double-sample data only for genotypes, we choose the “-p” option (see above – Sections 1.1.2 and 2.2.1) and perform 10,000 permutations. Results of the permutation analysis indicate that the p-values based on asymptotic theory are consistent with the p-values based on permutation (both asymptotic p-values are within the 99% confidence intervals for the p-values based on permutation).

However, as mentioned above (Section 1.1.2) these results should be viewed with some caution, as more extensive simulations need to be conducted. 

 

3.1       A note about selection of double-samples

 

When using the LRTae statistic, we assume that the double-sample information is obtained by randomly selecting a subset of individuals for classification with an infallible measure. However, this assumption may be violated. For example, with a disease like Alzheimer’s, there may be a bias in obtaining only the “gold-standard” diagnosis of presence/absence of plaques and tangles for observed cases, since observed controls are assumed to have died of other causes and therefore may not have autopsies performed. Similarly, researchers may only scrutinize very rare observed genotypes, because these can be particularly costly in terms of power loss for case/control studies (12).

 

3.2 Usage of LRTae with haplotype pairs

 

To demonstrate usage of LRTae with haplotype pair data, we consider an example.  Suppose we have two SNPs typed on a number of cases and controls. There are several methods available (e.g., SNPHAP from Clayton's lab - http://www-gene.cimr.cam.ac.uk/clayton/software/, PHASE - http://www.stat.washington. edu/stephens/software.html, Clark’s Method (13)) that will statistically infer haplotype pairs based on multi-locus genotype data. It is well documented that these methods are subject to misclassification of the true haplotype pairs (14), particularly in situations where there is low inter-marker LD, SNP allele frequencies are near equal, and/or there is genotype error (15). For examples using haplotype inference based on family data, see (16). The fallible haplotype pairs are those that are determined using one of these software methods. The infallible pairs are usually haplotype pairs based on molecular haplotyping methods (e.g., (17-19)).

One question that arises from this procedure is the observation that the highest posterior probability assigned to a pair of haplotypes for an individual may be significantly less than 1; that is, there may be a number of different haplotype pairs possible for an individual each with similar probabilities. What should one do in that situation?

What some people do is simply assign the haplotype pair that has the highest posterior probability to the individual. While in general this is a statistically unsound procedure (20), we are unaware of the effects that double-sample information may have in terms of maintaining correct type I error rates and power. This is active research.

Regarding coding of haplotypes, let’s consider the example of the two SNP loci mentioned above, each with alleles “1” and “2”. Then the four possible haplotypes are: 1-1, 1-2, 2-1, 2-2. To use LRTae, one can use one of two codings.

 

Using the "A" coding scheme, one may make the assignments:

 

1-1 = 1

1-2 = 2

2-1 = 3

2-2 = 4

 

Then, a person who is homozygous for the 2-1 haplotype, for example, is coded as 3 3 in the fallible and/or infallible files. A person who has the haplotype pairs 1-2/2-2 would be coded as 2 4 in the fallible and/or infallible files.

 

Using the "C" coding scheme, one may make the assignments:

 

1-1/1-1                  = 1

1-1/1-2 = 1-2/1-1 = 2

1-1/2-1 = 2-1/1-1 = 3

1-1/2-2 = 2-2/1-1 = 4

1-2/1-2                 = 5

1-2/2-1 = 2-1/1-2 = 6

1-2/2-2 = 2-2/1-2 = 7

2-1/2-1                 = 8

2-1/2-2 = 2-2/2-1 = 9

2-2/2-2                 = 10

 

Then a person who is homozygous for the 2-1 haplotype is coded as 8 in the fallible and/or infallible files and a person who has the haplotype pair 1-2/2-2 is coded as 7.

 

 

4.0 PROBLEMS? COMMENTS?

 

If there are problems in the execution or compilation of this program or if you would like to provide some feedback, please e-mail lrtae@linkage.rockefeller.edu

 

5.0 ACKNOWLEDGEMENTS

 

The authors of this software gratefully acknowledge grant K01-HG00055 from the

National Institutes of Health.

 

 

6.0 REFERENCES

 

Below are references for this README file. Please cite the first reference when reporting results using the LRTae software.

 

1.         Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., and Haroutunian, V. 2004. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet and Mol Biol 3:Article 26.

2.         Tenenbein, A. 1970. A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65:1350-1361.

3.         Tenenbein, A. 1972. A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14:187-202.

4.         Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc B 39:1-38.

5.         Cox, D.R., and Hinkley, D.V. 1979. Theoretical Statistics. Boca Raton: CRC Press.

6.         Ott, J. 1999. Analysis of Human Genetic Linkage. Baltimore: Johns Hopkins.

7.         Terwilliger, J.D., and Ott, J. 1994. Handbook of Human Genetic Linkage. Baltimore: Johns Hopkins.

8.         Xie, X., and Ott, J. 1993. Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 53:1107 (Abstract).

9.         Mote, V.L., and Anderson, R.L. 1965. An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109.

10.       Gordon, D., Finch, S.J., Nothnagel, M., and Ott, J. 2002. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 54:22-33.

11.       Rice, K.M., and Holmans, P. 2003. Allowing for genotyping error in analysis of unmatched cases and controls. Ann Hum Genet 67:165-174.

12.       Kang, S.J., Gordon, D., and Finch, S.J. 2004. What SNP genotyping errors are most costly for genetic association studies? Genet Epidemiol 26:132-141.

13.       Clark, A.G. 1990. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111-122.

14.       Niu, T. 2004. Algorithms for inferring haplotypes. Genet Epidemiol 27:334-347.

15.       Kirk, K.M., and Cardon, L.R. 2002. The impact of genotyping error on haplotype reconstruction and frequency estimation. Eur J Hum Genet 10:616-622.

16.       Lindholm, E., Zhang, J., Hodge, S.E., and Greenberg, D.A. 2004. The reliability of haplotyping inference in nuclear families: misassignment rates for SNPs and microsatellites. Hum Hered 57:117-127.

17.       Proudnikov, D., LaForge, K.S., and Kreek, M.J. 2004. High-throughput molecular haplotype analysis (allelic assignment) of single-nucleotide polymorphisms by fluorescent polymerase chain reaction. Anal Biochem 335:165-167.

18.       Douglas, J.A., Boehnke, M., Gillanders, E., Trent, J.M., and Gruber, S.B. 2001. Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat Genet 28:361-364.

19.       Ding, C., and Cantor, C.R. 2003. Direct molecular haplotyping of long-range genomic DNA with M1-PCR. Proc Natl Acad Sci U S A 100:7449-7453.

20.       Little, R.J.A., and Rubin, D.B. 1987. Statistical Analysis with Missing Data. New York: John Wiley.