PAWE version 1.2 Help File, February 2002

Written by Derek Gordon

 

1.0    Overview and Purpose

 

The name of this program is PAWE, which stands for Power of Association With Errors. Because it has been previously documented (Mote and Anderson 1965; Gordon et al. 2002) that genotyping errors can substantially decrease the asymptotic power to detect association between a trait locus and a marker locus, the purpose of the PAWE program is two fold: (i) to compute power and sample size calculations for genetic case-control association studies in the presence of genotyping errors, and (ii) to determine quantitatively how much, in terms of decrease in asymptotic power for a fixed sample size, or increase in sample size to maintain constant asymptotic power, genotyping errors cost the researcher performing  genetic association studies with cases and controls. Thus, results from the PAWE program will be either asymptotic power or sample size values. We note that the results we obtain for data without errors are identical to the results obtained by other genetic association test power calculators, for example the Genetic Power Calculator for case-control studies of discrete traits developed by authors Purcell and Sham.

 

This file is designed to explain to you the different values entered into the PAWE program, so that a meaningful answer is obtained. This program is designed to perform asymptotic power and sample size calculations for genetic case-control studies with a di-allelic locus (for example, a SNP) in the presence of errors. The test statistics considered are the standard chi-square statistics for allelic and genotypic association. In what follows, it will be assumed that there is a di-allelic trait locus for a discrete trait with two alleles: a wild-type allele or low risk allele, denoted by +, and a trait or high-risk allele, denoted by d. Also, it will be assumed that there is a marker locus with two alleles, denoted by 1 and 2.

 

1.1 Parameter Settings

 

1.1.1 Asymptotic Power or Sample Size

 

The researcher who has a fixed sample size and wants to know the value of asymptotic power for his/her study in the presence of errors should choose the option  "power for a fixed sample size". The researcher who is planning a study, and who wants to know what sample size is necessary, in the presence of errors, to achieve a given asymptotic power level should choose "sample size for fixed power".

 

1.1.2 Asymptotic Power for a fixed sample size

 

1.1.2.1 Number of cases

 

In this box, you specify the number of case individuals you have. These individuals are assumed to be both phenotyped and genotyped. The number typed in this box must be a positive integer.

 

 

1.1.2.2 Number of controls

 

In this box, you specify the number of control individuals you have. These individuals are assumed to be both phenotyped and genotyped.  The number typed in this box must be a positive integer.

 

1.1.3 Sample Size for a fixed asymptotic power

 

1.1.3.1 Asymptotic Power level

 

In this box, you specify the asymptotic power you would like for your study. This number must be greater than 0 and less than or equal to 1. It is usually a number closer to 1.

 

1.1.3.2 Ratio of Controls to Cases

 

In this box, you specify the ratio of the number of controls to the number of cases that you expect to have for your study. The number entered here must be a positive real number.

 

1.1.4 Genotype Frequency Generation

 

1.1.4.1 Genetic model free method

 

You choose this option if you do not know the genetic model parameters such as penetrances, disease allele frequency, or proportion of linkage disequilibrium for your study. When choosing the genetic model free method, you specify the frequency distribution of genotypes 11, 12, and 22 for cases and controls. You can assume or not assume Hardy Weinberg equilibrium (HWE) for both the case and control population.

 

1.1.4.1.1 Hardy Weinberg equilibrium Assumed

 

If you assume HWE, then the genotype frequency distribution is a function of a single parameter, p, that you enter into this box. The parameter p must be a real, positive number less than 1. The genotype frequencies of 11, 12, and 22 are then , respectively.

 

1.1.4.1.2 No Assumption of Hardy Weinberg equilibrium

 

If you do not assume HWE, then the genotype frequency distribution is a function of two parameters, . The parameteris the genotype frequency for the 11 genotypes, and the parameter  is the genotype frequency for the 22 genotypes. The genotype frequency for the 12 genotype is given by. Thus, the numbers you enter into these two boxes must be postive  real numbers whose sum is less than 1.

 

1.1.4.2 Genetic model based method

 

You choose this option if you have estimates for the following 6 parameters: the penetrances ,, and , the disease allele frequency , the marker 1-allele frequency, and the proportion of linkage disequilibrium (D'). In the previous sentence, the abbreviation “aff” means “being a case”, and the three penetrances are conditional probabilities, the conditioning being on the genotype at the trait locus. All six of these entered parameters must be positive real numbers that are less than 1. To see how these parameters are translated into genotype frequencies for cases and controls, please click on the link: PAWE1

 

Note:      D'=1 means complete disequilibrium, the best case scenario

                D'=0 means no disequilibrium, the null scenario

 

 

1.1.5 Significance level

 

Here, you specify the significance level of the test. This value is the probability of falsely rejecting a true null hypothesis. This number must be positive and less than 1. Typically, it is chosen to be less than or equal to 0.05.

 

1.1.6 Error models

 

For this option, you have the choice of selecting one of four error models that you think best explains your data. The choices are:

 

Gordon Heath Liu Ott (GHLO) error model      (2001)

Douglas Skol Boehnke (DSB) error model       (2002)

Sobel Papp Lange (SPL) error model                (2002)

Mote and Anderson (MA) error model            (1965)

 

Presented here is a brief, but by no means comprehensive, list of some of the differences among the models. The GHLO model introduces errors into alleles as opposed to genotypes. It is described by 2 parameters. The DSB model introduces errors into genotypes, and is the only model for which it is not possible for a homozygous 11 genotype to be incorrectly recoded as a homozygous 22 genotype, or vice versa.  It is described by 2 parameters.  The SPL model is, for di-allelic loci, described by 3 parameters. It is the most general error model possible for di-allelic loci, under the constraint that errors are independent of the particular allele.  The MA model, which is the most general error model possible in the sense that it can describe all other error models, is described by 6 parameters.  The GHLO, SPL, and MA error models all allow for errors in which one homozygote is incorrectly miscoded as another homozygote.

 

1.1.6.1 Gordon Heath Liu Ott (GHLO) error model

 

The parameter settings for this error model are:

= Pr(1 allele incorrectly coded as 2 allele)

= Pr(2 allele incorrectly coded as 1 allele)

 

Both entries must be positive real numbers less than 1.0.

 

For more information, see:

Gordon D., Heath S.C., Liu X., and Ott J. (2001) A transmission disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data.  American Journal of Human Genetics 69:371-380

 

1.1.6.2 Douglas Skol Boehnke (DSB) error model (2002)

 

The parameter settings for this error model are:

= Pr(homozygous 11 or 22 genotype incorrectly coded as heterozygote 12)

= Pr(heterozygote 12 genotype incorrectly coded as homozygote 11 or 22)

 

Both entries must be positive real numbers less than 1.0.

 

Note: for theparameter, it is assumed that the 12 genotype has an equal probability (0.5) of being incorrectly coded as 11 or 22. Also, the notation used here comes from the Gordon et al. (2002) reference.

 

For more information, see:

 

Douglas J.A., Skol A.D., and Boehnke M. (2002) Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. American Journal of Human Genetics 70:487-495

 

1.1.6.3 Sobel Papp Lange (SPL) error model (2002)

 

The parameter settings for this error model are:

V1 = Pr(true homozygote incorrectly coded as heterozygote)

V2 = Pr(one homozygote incorrectly coded as another homozygote)

V3 = Pr(true heterozygote incorrectly coded as a homozygote)

 

Note: This parameterization of the SPL error model is an improvement over the parameterization previously used (Gordon et al. 2002) in that it only requires three parameter settings. The author gratefully acknowledges S. Seaman and P. Holmans for the improvement.

 

All entries must be positive real numbers less than 1.0, subject to the following constraints:

V1 + V2 < 1.0

V3 < 0.5

 

For more information, see:

Sobel E., Papp J.C., and Lange K. (2002) Detection and integration of genotyping errors in statistical genetics. American Journal of Human Genetics 70:496-508

 

1.1.6.4 Mote and Anderson (MA) error model (1965)

 

The parameter settings for this error model are:

= Pr(12 genotype observed | 11 true)

= Pr(22 genotype observed | 11 true)

= Pr(11 genotype observed | 12 true)

= Pr(22 genotype observed | 12 true)

= Pr(11 genotype observed | 22 true)

= Pr(12 genotype observed | 22 true)

 

The following constraints are needed for the MA error model:

 

For more information, see:

Mote V.L., and Anderson R.L. (1965) An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109

 

2.1 Output

 

The PAWE program reports most of the input parameters that the user enters, as well as the following items:

 

2.1.1 Non-centrality parameters

 

For a given test of association (allelic or genotypic), this parameter completely determines either the asymptotic power or sample size calculations. To see how the asymptotic power calculations are performed, please see Gordon et al. (2002) and PAWE2.  The non-centrality parameters for both errorless data and data with errors are presented.

 

2.1.2 Asymptotic power for fixed sample size - power loss

 

Based on the value of the non-centrality parameter for either the allelic or genotypic test of association, the asymptotic power of the test is reported for both errorless data and data assuming the particular error model. Also reported is the percent loss in power due to errors in the data.

 

2.1.3 Sample size increase for fixed asymptotic power

 

Based on the value of the non-centrality parameter for either the allelic or genotypic test of association, the minimum sample of cases and controls is reported for both errorless data and data assuming the particular error model.  Also reported is the percent increase in sample size needed to maintain constant power when errors are present.

 

2.1.4 Genotype and allele frequencies for errorless data

 

Based on the parameters entered for genotype frequency generation (Section 1.1.4), the genotype and allele frequencies in cases and controls are computed for errorless data.

 

2.1.5 Matrix of Penetrances

 

The entries of this matrix are the conditional probabilities Pr(observed genotype i | true genotype j) for i and j being one of the genotypes 11, 12, or  22. These conditional probabilities, also called penetrances, are used in calculating the   genotype and allele frequencies in the presence of errors (see Section 2.1.6).

 

2.1.6 Genotype and allele frequencies for error data

 

Using the genotype and allele frequencies in cases and controls for errorless data  (Section 2.1.4), and the matrix of penetrances (Section 2.1.5), genotype and allele frequencies in cases and controls are computed for error data. These values are used to compute the non-centrality parameters for error data (Section 2.1.1). For more details on how this computation is performed, please see Gordon et al. (2002).

 

3.1 References

 

Please cite the following two references when reporting results using PAWE:

 

Gordon D., Finch S.J., Nothnagel M., and Ott J.  (2002) Power and sample size calculations for case-control genetic association tests when errors present: application to single nucleotide polymorphisms. Human Heredity 54:22-33

 

Gordon D., Levenstien M.A., Finch S.J., and Ott  J. (2003) Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic  case-control association studies. Pacific Symposium on Biocomputing:490-501.

 

Citations for error models:

 

Gordon D., Heath S.C., Liu  X., and  Ott J. (2001) A transmission disequilibrium that allows for genotyping errors in the  analysis of single-nucleotide  polymorphism  data. American Journal of Human Genetics 69:371-380

 

Douglas J.A., Skol A.D., and  Boehnke  M. (2002) Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data.  American Journal of Human Genetics 70:487-495

 

Sobel E., Papp J.C., and Lange K. (2002) Detection and integration of genotyping errors in statistical genetics. American  Journal of Human Genetics 70:496-508

 

Mote V.L., and Anderson R.L.  (1965) An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109