PAWE – Phenotype Edition

Version 2.0 Help File, March 2005

Written by Derek Gordon

 

1.0    Overview and Purpose

 

The name of this program is PAWE – Phenotype Edition (PAWE-PH). PAWE stands for Power of Association With Errors. The original version of PAWE is concerned with genotype errors. However, of equal importance (if not greater importance given that newer genotyping technologies are reporting smaller error rates [1]) are phenotype misclassification errors [2]. These errors are often referred to as diagnostic errors [3]. Recently published research [3, 4] indicates that phenotype misclassification errors can substantially decrease the asymptotic power to detect genetic association. As with the original PAWE program for genotyping error, the purpose of the PAWE-PH program is two fold: (i) to compute power and sample size calculations for genetic case-control association studies in the presence of phenotype misclassification errors, and (ii) to determine quantitatively how much, in terms of decrease in asymptotic power for a fixed sample size, or increase in sample size to maintain constant asymptotic power, phenotype misclassification errors cost the researcher performing genetic association studies with cases and controls. Thus, results from the PAWE-PH program will be either asymptotic power or sample size values.

 

This file is designed to explain to you the different values entered into the PAWE-PH program, so that a meaningful answer is obtained. This program is designed to perform asymptotic power and sample size calculations for genetic case-control studies with a di-allelic locus (for example, a SNP) in the presence of errors. The test statistic considered is the standard chi-square test of independence applied to genotypes. In what follows, for the genetic model-free scenario, it will be assumed that there is a di-allelic marker locus with two alleles, denoted by 1 and 2. In the genetic model-based framework, we assume there is an (unobserved) trait locus for with two alleles: a wild-type allele or low risk allele, denoted by +, and a trait or high-risk allele, denoted by d. Also, we assume that there is a marker locus with two alleles, denoted by 1 and 2.

 

Important note: When discussing case/control studies in the presence of phenotype misclassification, we make a distinction between the terms case and affected, and similarly between the terms control and unaffected. A case individual is someone who is diagnosed as being affected, whether or not the individual is affected. An affected individual is someone who is truly affected. It is assumed that we only collect cases and controls; that is, some random proportion of cases have unaffected individuals among the cases and similarly for controls. For more information, see [4].

 

1.1 Parameter Settings

 

1.1.1 Asymptotic Power or Sample Size

 

The researcher who has a fixed sample size and wants to know the value of asymptotic power for his/her study in the presence of errors should choose the option  "power for a fixed sample size". The researcher who is planning a study, and who wants to know what sample size is necessary, in the presence of errors, to achieve a given asymptotic power level should choose "sample size for fixed power".

 

1.1.2 Asymptotic Power for a fixed sample size

 

1.1.2.1 Number of cases

 

In this box, you specify the number of case individuals you have. These individuals are assumed to be both phenotyped and genotyped. The number typed in this box must be a positive integer.

 

 

1.1.2.2 Number of controls

 

In this box, you specify the number of control individuals you have. These individuals are assumed to be both phenotyped and genotyped.  The number typed in this box must be a positive integer.

 

1.1.3 Sample Size for a fixed asymptotic power

 

1.1.3.1 Asymptotic Power level

 

In this box, you specify the asymptotic power you would like for your study. This number must be greater than 0 and less than or equal to 1. It is usually a number closer to 1.

 

1.1.3.2 Ratio of Controls to Cases

 

In this box, you specify the ratio of the number of controls to the number of cases that you expect to have for your study. The number entered here must be a positive real number.

 

1.1.4 Genotype Frequency Generation

 

1.1.4.1 Genetic model free method

 

You choose this option if you do not know the genetic model parameters such as penetrances, disease allele frequency, or proportion of linkage disequilibrium for your study. When choosing the genetic model free method, you specify the frequency distribution of genotypes 11, 12, and 22 for the affected population and the unaffected population. You can assume or not assume Hardy Weinberg equilibrium (HWE) for both the affected and unaffected population.

 

1.1.4.1.1 Hardy Weinberg equilibrium Assumed

 

If you assume HWE, then the genotype frequency distribution is a function of a single parameter p that you enter into this box. The parameter p must be a real, positive number less than 1. The genotype frequencies of 11, 12, and 22 are then , , and , respectively.

 

1.1.4.1.2 No Assumption of Hardy Weinberg equilibrium

 

If you do not assume HWE, then the genotype frequency distribution is a function of two parameters, and . The parameteris the genotype frequency for the 11 genotypes, and the parameteris the genotype frequency for the 22 genotypes. The genotype frequency for the 12 genotype is given by . Thus, the numbers you enter into these two boxes must be positive, real numbers whose sum is less than 1.

 

1.1.4.2 Genetic model based method

 

You choose this option if you have estimates for the following five parameters: the genotype relative risks R1 and R2 [5], the disease allele frequency, the marker 1-allele

frequency, and the proportion of linkage disequilibrium (D') [6]. The genotype relative risks are defined as: , , where .  The genotype relative risk parameters must be positive real numbers greater than 1. The disease allele frequency, the marker 1-allele frequency, and the D' parameters must be positive real numbers that are less than 1. To see how the genotype relative risks, disease allele frequency, and disease prevalence (below – Section 1.1.7) are converted to the usual disease penetrance parameters, click on the following link: PAWE-PH1

 

Note:      D'=1 means complete disequilibrium, the best case scenario

               D'=0 means no disequilibrium, the null scenario

 

 

1.1.5 Error model parameters

 

For phenotype misclassification error in case/control genetic association analyses, we consider two types of misclassification: the probability that a true affected individual is misclassified as an observed control, and the probability that a true unaffected individual is misclassified as an observed case. To our knowledge, Bross [2] was the first to consider these misclassification parameters in case/control association studies.

 

1.1.5.1 Pr(observed control | true affected)

 

For the distinction between affected and case, see above (Section 1.0 – Important Note). This value is the probability that a true affected individual is misclassified as a control. It must range between 0.0 and 1.0. This value may also be though of as 1.0 – Sensitivity of the diagnostic instrument used to phenotype an individual.

 

1.1.5.2 Pr(observed case | true unaffected)

 

For the distinction between unaffected and control, see above (Section 1.0 – Important Note). This value is the probability that a true unaffected individual is misclassified as a case. It must range between 0.0 and 1.0. This value may also be though of as 1.0 – Specificity of the diagnostic instrument used to phenotype an individual.

 

1.1.6 Significance level

 

Here, you specify the significance level of the test. This value is the probability of falsely rejecting a true null hypothesis. This number must be positive and less than 1. Typically, it is chosen to be less than or equal to 0.05.

 

1.1.7 Disease prevalence

 

This value is the probability that a randomly selected individual from the sample population is truly affected. It is a value that ranges between 0.0 and 1.0.

 

2.1 Output

 

The PAWE-PH program reports most of the input parameters that the user enters, as well as the following items:

 

2.1.1 Non-centrality parameters

 

For a given test of association (allelic or genotypic), this parameter completely determines either the asymptotic power or sample size calculations. To see how the asymptotic power calculations are performed, please see our original PAWE paper [7] and our most recent reference [4].  The non-centrality parameters for both errorless data and data with errors are presented.

 

2.1.2 Asymptotic power for fixed sample size - power loss

 

Based on the value of the non-centrality parameter for the genotypic test of association, the asymptotic power of the test is reported for both errorless data and data with errors. Also reported is the percent loss in power due to phenotype misclassification errors in the data.

 

2.1.3 Sample size increase for fixed asymptotic power

 

Based on the value of the non-centrality parameter for the genotypic test of association, the minimum sample of cases and controls is reported for both errorless data and data with errors.  Also reported is the percent increase in sample size needed to maintain constant power when phenotype misclassification errors are present.

 

2.1.4 Genotype and allele frequencies for errorless data

 

Based on the parameters entered for genotype frequency generation (Section 1.1.4), the genotype frequencies in cases and controls are computed for errorless data (i.e., assuming no phenotype misclassification).

 

2.1.5 Matrix of Penetrances

 

The entries of this matrix are the conditional probabilities Pr(observed phenotype = i | true phenotype = j) for i being either case or control and j being either affected or unaffected. For more on this terminology, see above (Section 1.0 – Important Note).

 

2.1.6 Genotype frequencies for error data

Using the genotype frequencies in the affected and unaffected populations for errorless

Data, (Section 2.1.4), the matrix of penetrances (Section 2.1.5), and the disease prevalence (Section 1.1.7) genotype frequencies in cases and controls are computed for data with phenotype misclassification error. To see how the case and control genotype frequencies are calculated using these parameters, click on the link: PAWE-PH2. These frequencies are used to compute the non-centrality parameters for error data (Section 2.1.1). 

 

3.1 References

 

Please cite the following two references when reporting results using PAWE-PH:

 

Gordon D, Finch SJ, Nothnagel M, Ott J: Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 2002, 54(1):22-33. 

 

Edwards BJ, Haynes C, Levenstien MA, Finch SJ, Gordon D: Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Bioinformatics 2005:(under review).

 

References

1.         Hao K, Li C, Rosenow C, Hung Wong W: Estimation of genotype error rate using samples with pedigree information--an application on the GeneChip Mapping 10K array. Genomics 2004, 84(4):623-630.

2.         Bross I: Misclassification in 2 x 2 tables. Biometrics 1954, 10:478-486.

3.         Zheng G, Tian X: The impact of diagnostic error on testing genetic association in case-control studies. Stat Med 2005, 24(6):869-882.

4.         Edwards BJ, Haynes C, Levenstien MA, Finch SJ, Gordon D: Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Bioinformatics 2005:(under review).

5.         Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet 1993, 53(5):1114-1126.

6.         Lewontin RC: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 1964, 49:49-67.

7.         Gordon D, Finch SJ, Nothnagel M, Ott J: Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 2002, 54(1):22-33.