PAWE – Phenotype Edition
Version 2.0 Help File, March 2005
Written by Derek Gordon
1.0 Overview and Purpose
The name of this program is PAWE – Phenotype Edition (PAWE-PH).
PAWE stands for Power of Association With Errors.
The original version of PAWE
is concerned with genotype errors. However, of equal importance (if not greater
importance given that newer genotyping technologies are reporting smaller error
rates [1]) are phenotype misclassification errors [2]. These errors are often referred to as
diagnostic errors [3]. Recently published research [3, 4] indicates that phenotype
misclassification errors can substantially decrease the asymptotic power to
detect genetic association. As with the original PAWE program for genotyping
error, the purpose of the PAWE-PH program is two fold: (i) to compute power and
sample size calculations for genetic case-control association studies in the
presence of phenotype misclassification errors, and (ii) to determine
quantitatively how much, in terms of decrease in asymptotic power for a fixed
sample size, or increase in sample size to maintain constant asymptotic power,
phenotype misclassification errors cost the researcher performing genetic
association studies with cases and controls. Thus, results from the PAWE-PH
program will be either asymptotic power or sample size values.
This file is designed to explain to you the different values
entered into the PAWE-PH program, so that a meaningful answer is obtained. This
program is designed to perform asymptotic power and sample size calculations
for genetic case-control studies with a di-allelic locus (for example, a SNP)
in the presence of errors. The test statistic considered is the standard
chi-square test of independence applied to genotypes. In what follows, for the
genetic model-free scenario, it will be assumed that there is a di-allelic
marker locus with two alleles, denoted by 1 and 2. In the genetic model-based
framework, we assume there is an (unobserved) trait locus for with two alleles:
a wild-type allele or low risk allele, denoted by +, and a trait or high-risk
allele, denoted by d. Also, we assume that there is a marker locus with
two alleles, denoted by 1 and 2.
Important note: When discussing case/control studies in the presence of phenotype
misclassification, we make a distinction between the terms case and affected,
and similarly between the terms control and unaffected. A case
individual is someone who is diagnosed as being affected, whether or not the
individual is affected. An affected individual is someone who is truly
affected. It is assumed that we only collect cases and controls; that is, some
random proportion of cases have unaffected individuals among the cases and similarly
for controls. For more information, see [4].
1.1 Parameter Settings
1.1.1 Asymptotic Power or Sample Size
The researcher who has a fixed sample size and wants to know the
value of asymptotic power for his/her study in the presence of errors should
choose the option "power for a
fixed sample size". The researcher who is planning a study, and who wants
to know what sample size is necessary, in the presence of errors, to achieve a
given asymptotic power level should choose "sample size for fixed
power".
1.1.2 Asymptotic Power for a fixed sample size
1.1.2.1 Number of cases
In this box, you specify the number of case individuals you have.
These individuals are assumed to be both phenotyped and genotyped. The number
typed in this box must be a positive integer.
1.1.2.2 Number of controls
In this box, you specify the number of control individuals you
have. These individuals are assumed to be both phenotyped and genotyped. The number typed in this box must be a
positive integer.
1.1.3 Sample Size for a fixed asymptotic power
1.1.3.1 Asymptotic Power level
In this box, you specify the asymptotic power you would like for
your study. This number must be greater than 0 and less than or equal to 1. It
is usually a number closer to 1.
1.1.3.2 Ratio of Controls to Cases
In this box, you specify the ratio of the number of controls to
the number of cases that you expect to have for your study. The number entered
here must be a positive real number.
1.1.4 Genotype Frequency Generation
1.1.4.1 Genetic model free method
You choose this option if you do not know the genetic model
parameters such as penetrances, disease allele frequency, or proportion of
linkage disequilibrium for your study. When choosing the genetic model free
method, you specify the frequency distribution of genotypes 11, 12, and 22 for
the affected population and the unaffected population. You can assume or not
assume Hardy Weinberg equilibrium (HWE) for both the affected and unaffected
population.
1.1.4.1.1 Hardy Weinberg equilibrium Assumed
If you assume HWE, then the genotype frequency distribution is a
function of a single parameter p that you enter into this box. The
parameter p must be a real, positive number less than 1. The genotype
frequencies of 11, 12, and 22 are then
,
, and
, respectively.
1.1.4.1.2 No Assumption of Hardy Weinberg equilibrium
If you do not assume HWE, then the genotype frequency distribution
is a function of two parameters,
and
. The parameter
is the genotype frequency for the 11 genotypes, and the
parameter
is the genotype frequency for the 22 genotypes. The genotype
frequency for the 12 genotype is given by
. Thus, the numbers you enter into these two boxes must be
positive, real numbers whose sum is less than 1.
1.1.4.2 Genetic model based method
You choose this option if you have estimates for the following
five parameters: the genotype relative risks R1 and R2
[5], the disease allele frequency
, the marker 1-allele
frequency
, and the proportion of linkage disequilibrium (D') [6]. The genotype relative risks are defined
as:
,
, where
. The genotype
relative risk parameters must be positive real numbers greater than 1. The
disease allele frequency, the marker 1-allele frequency, and the D' parameters
must be positive real numbers that are less than 1. To see how the genotype
relative risks, disease allele frequency, and disease prevalence (below –
Section 1.1.7) are converted to the usual disease penetrance parameters, click
on the following link: PAWE-PH1
Note: D'=1 means
complete disequilibrium, the best case scenario
D'=0 means
no disequilibrium, the null scenario
1.1.5 Error model parameters
For phenotype misclassification error in case/control genetic
association analyses, we consider two types of misclassification: the
probability that a true affected individual is misclassified as an observed
control, and the probability that a true unaffected individual is misclassified
as an observed case. To our knowledge, Bross [2] was the first to consider these
misclassification parameters in case/control association studies.
1.1.5.1 Pr(observed control | true affected)
For the distinction between affected and case, see above (Section
1.0 – Important Note). This value is the probability that a true affected
individual is misclassified as a control. It must range between 0.0 and 1.0.
This value may also be though of as 1.0 – Sensitivity of the diagnostic
instrument used to phenotype an individual.
1.1.5.2 Pr(observed case | true unaffected)
For the distinction between unaffected and control, see above
(Section 1.0 – Important Note). This value is the probability that a true
unaffected individual is misclassified as a case. It must range between 0.0 and
1.0. This value may also be though of as 1.0 – Specificity of the
diagnostic instrument used to phenotype an individual.
1.1.6 Significance level
Here, you specify the significance level of the test. This value
is the probability of falsely rejecting a true null hypothesis. This number
must be positive and less than 1. Typically, it is chosen to be less than or
equal to 0.05.
1.1.7 Disease prevalence
This value is the probability that a randomly selected individual
from the sample population is truly affected. It is a value that ranges between
0.0 and 1.0.
2.1 Output
The PAWE-PH program reports most of the input parameters that the
user enters, as well as the following items:
2.1.1 Non-centrality parameters
For a given test of association (allelic or genotypic), this
parameter completely determines either the asymptotic power or sample size
calculations. To see how the asymptotic power calculations are performed,
please see our original PAWE paper [7] and our most recent reference [4].
The non-centrality parameters for both errorless data and data with
errors are presented.
2.1.2 Asymptotic power for fixed sample size - power loss
Based on the value of the non-centrality parameter for the
genotypic test of association, the asymptotic power of the test is reported for
both errorless data and data with errors. Also reported is the percent loss in
power due to phenotype misclassification errors in the data.
2.1.3 Sample size increase for fixed asymptotic power
Based on the value of the non-centrality parameter for the
genotypic test of association, the minimum sample of cases and controls is
reported for both errorless data and data with errors. Also reported is the percent increase in
sample size needed to maintain constant power when phenotype misclassification
errors are present.
2.1.4 Genotype and allele frequencies for errorless data
Based on the parameters entered for genotype frequency generation
(Section 1.1.4), the genotype frequencies in cases and controls are computed for
errorless data (i.e., assuming no phenotype misclassification).
2.1.5 Matrix of Penetrances
The entries of this matrix are the conditional probabilities
Pr(observed phenotype = i | true phenotype = j) for i
being either case or control and j being either affected or unaffected.
For more on this terminology, see above (Section 1.0 – Important Note).
2.1.6 Genotype frequencies for error data
Using the genotype frequencies in the affected and unaffected
populations for errorless
Data, (Section 2.1.4), the matrix of penetrances (Section 2.1.5),
and the disease prevalence (Section 1.1.7) genotype frequencies in cases and
controls are computed for data with phenotype misclassification error. To see
how the case and control genotype frequencies are calculated using these
parameters, click on the link: PAWE-PH2. These
frequencies are used to compute the non-centrality parameters for error data
(Section 2.1.1).
3.1 References
Please cite the following two references when reporting results
using PAWE-PH:
Gordon D, Finch SJ,
Nothnagel M, Ott J: Power and sample
size calculations for case-control genetic association tests when errors are
present: application to single nucleotide polymorphisms. Hum Hered 2002, 54(1):22-33.
Edwards BJ, Haynes C,
Levenstien MA, Finch SJ, Gordon D: Power
and sample size calculations in the presence of phenotype errors for
case/control genetic association studies. BMC Bioinformatics 2005:(under review).
1. Hao K, Li C, Rosenow C, Hung Wong W: Estimation of genotype error rate using samples with pedigree information--an application on the GeneChip Mapping 10K array. Genomics 2004, 84(4):623-630.
2. Bross I: Misclassification in 2 x 2 tables. Biometrics 1954, 10:478-486.
3. Zheng G, Tian X: The impact of diagnostic error on testing genetic association in case-control studies. Stat Med 2005, 24(6):869-882.
4. Edwards BJ, Haynes C, Levenstien MA, Finch SJ, Gordon D: Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Bioinformatics 2005:(under review).
5. Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet 1993, 53(5):1114-1126.
6. Lewontin RC: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 1964, 49:49-67.
7. Gordon D, Finch SJ, Nothnagel M, Ott J: Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 2002, 54(1):22-33.