Josephine Hoh
|
Jurg Ott
|
Yale University, New Haven
|
Institute of Psychology CAS, Beijing
|
josephine.hoh@yale.edu
|
jurg.ott@gmail.com
|
27 March 2011
|
http://linkage.rockefeller.edu/ott/
|
S statistic in gene mapping
(Sumstat and
Statpval
computer programs)
Methodology
The general approach is described in Hoh et al. (2001). Briefly,
consider a number of marker loci in the genome. At each marker,
genotypes are available for two types of observations (but see below
for quantitative trait phenotypes), for example:
- Case and control individuals, or
- Two types of sibling pairs, affected-affected (AA) and
affected-unaffected (AU), or
- Families for which two types of genotypes are available, (1)
observed genotypes and (2) genotypes generated under the hypothesis of
no linkage, for example, with the simulate2
computer program.
In the current implementation, marker loci must be SNPs (two alleles each). The programs come in different versions
(click to download):
- Regular version, sumstatR: The general idea is to find a set of SNPs (variables, generally)
that jointly are associated with disease. We do this by initially
computing an association statistic for each SNP, for example, χ2
for a 2 × 3 table, where the two rows correspond to
cases and controls, and the three columns refer to SNP genotypes. Then
markers are ordered by the size of their test statistics, and sums, S, are formed sequentially,
starting with the largest test statistic and gradually adding one after
another SNP up to a maximum number of terms in S. For example, S3 will be the sum of
the three largest test statistics for SNPs wherever they are in the
genome. For each Si,
an associated significance level is computed with permutation testing.
Most of the outline below refers to this standard version.
- S version, sumstatS:
The program can accommodate larger numbers of SNPs and
observations than the standard version because it writes to and reads
from disk (unformatted, fast reads) what it needs in each interation while the regular version
keeps all data in memory.
- Q version, sumstatQ:
Output variables (phenotypes) are quantitative traits, QTLs. This
program version is different from the other two in the test statistics it allows.
The sumstatR and sumstatS programs come in one package.
The problem handled here is discussed in Manly (2007) under the
heading
of Comparison of Sample Mean Vectors.
The use of a sum of univariate test
statistics (which is what we do here) as a multivariate test statistic
for a randomization test seems to have been first proposed by Chung and
Fraser (1958).
Permutation tests
For each of our sum statistics, determining the associated
empirical significance level
is mathematically intractable, which
is why we perform permutation (randomization) tests (Manly 2007). Under
the hypothesis of no association, any permutation of the labels for the
two outcome types (case, control) is expected to be equally likely.
Because of the
potentially very large number of possible permutations, we take a
random sample of permutations and approximate empirical significance
levels. Based on the given array of case and control labels in the
observed data, random permutations of labels are generated sequentially
according to an algorithm by Nijenhuis and Wilf (1978). Specifically,
let Sobs be the
sum statistic obtained in the observed data for a given number of SNPs.
In each permutation sample (same genotypes as observed but labels
"affected" and "unaffected" permuted), we compute a sum statistic, Sperm, in the same way
as in the observed data. The proportion of permutation samples for
which Sperm ≥ Sobs is then an estimate
for the empirical significance level associated with Sobs.
A second level of permutations is performed (in the statpval program) as follows.
Assume that we
evaluate
N sum statistics, that is, the
N-th sum statistic contains
the test statistics summed over N
SNPs (N = 15 is generally a
suitable
number). For each sum statistic with a given number of terms, the
associated
empirical significance level is determined. We then take the smallest
of
these as our single overall statistic of interest and must determine
its associated (experiment-wise) significance level. This
is done on the basis of the permutation samples obtained in level (1)
(Manly 2007, section 6.8): Each of these samples in turn is taken as an
"observed" (test) dataset
and the remaining permutation samples are used to evaluate the
significance level of the smallest p-value
in the "observed" permutation sample.
Some authors argue that under the null hypothesis the observed data
should be included in the randomized data but we don't do this here.
With missing observations, it may happen for
one or more variables that a permutation sample
contains observations on only one class (either cases or controls).
Such ill-conditioned permutation samples will be skipped.
Implementation
The approach described above has been implemented in two computer
programs, sumstat and statpval, where sumstat computes sum
statistics and their associated p-values,
and statpval
evaluates the (experiment-wise) significance level associated with the
smallest of these p-values.
All p-values are evaluated
on a genome-wide basis (i.e. for as many input variables as there are
in the study). The user may restrict attention to a subset of
variables, for example, the markers on a given chromosome.
Input and output for Sumstat programs
This section refers to sumstatR and sumstatS. The sumstat program works with the following two files, both in text
(ASCII)
format. Note new
input format.
1. Data file
This input file (named in the parameter file below)
should consist of a matrix with K
+ 1 rows and N
columns, where N = number of
observations
(e.g., case and control individuals) and K = number of input variables (eg.
SNPs).
Each cell in the matrix contains a genotype code. The last row contains the
code for the type of observation (e.g., 2 = case and 1 = control). Optionally,
the N input observations
on each row (for a given marker) may be followed by three
quantities: (1) Chromosome number, (2) position in base pairs of the
marker on the chromosome, and (3) marker name. These items will allow
for an easy ordering of markers by their chromosomal positions, for
example, with the sort
command in Windows (not really relevant for sumstat,
but for scanstat).
On the last input line, you may use the
following dummy values: (1) 99, (2) 99, (3) xx.
NOTE:
Do not use marker names starting with x, &, %, $, followed by
digits or else the program will interpret these as numbers to a base
other than 10. For example, x11 will be interpreted as a hexadecimal
number equal to (decimal) 17. If you must use x plus digits then
use xx plus digits, for example, xx11. The program will recognize this
as text, which it must do to properly count individuals.
Your data may be in plink format, in which case you may use the included program, p2s, to convert the plink files to sumstat format. Using plink and assuming you have files called data.ped and data.map, you first must produce transposed datasets with the command,
plink --file data --transpose --recode12
Then run the conversion program, p2s, and follow instructions.
2. Parameter file
The sumstat program reads
parameter values interactively from the
standard input (i.e. the keyboard). In practice, it is convenient
to collect these values in a file named inputxx.txt and
call the program with the command, gosum
xx (see below). An example for such a parameter file is as
follows:
sumstat: Sample data, generated. Heritability 0.50, threshold 2.32
1 0 codes for genotypes and missing
1 1 code for test statistic, lambda
15 max # terms in sum
2000 number of permutation samples
0 list number of test-SNP for interaction. Below: Input file name
sumstatSample.dat
The various lines of input specify the following parameter values (this text is also contained in the inputSample.txt file):
Line 1: Any text up to 250 characters.
Line 2:
Code for genotypes given in the data:
-2 = any observed values (no adjustments will be made)
-1 = the codes are -1, 0, 1 (genotype codes will
0 = the codes are 0, 1, 2 (be adjusted to 1, 2, 3
1 = the codes are 1, 2, 3 (internally.
Value for "missing"
Line 3:
Code for single-locus test statistic (for details, see below)
Lambda value for genomic control (if statistics code=1 or code=12)
Line 4:
Maximum number of terms in sum
Line 5:
Number of permutation samples
Line 6:
List number of test SNP, which will be paired with all other SNPs in
turn (statistics 15, 16, 17, 19, 20). For statistic code 16, if
SNP number is given as a positive number then the maximum of the 3
chi-squares serves as test statistic, otherwise it's the sum of the 3
chi-squares.
Line 7:
Name of input file. No other characters before or after the name.
----------------------------------------
Codes on line 4:
1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.
10 = Armitage test for trend
11 = t-test. SEE NOTE BELOW.
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW.
15
= chi-square for interaction between given SNP and all other SNPs.
Homogeneity of odds ratio between cases and controls for largest of 4
independent 1-df chi-squares.
16 = chi-squares for 2 x 3 table of
genotypes conditioned on genotypes of test SNP. If SNP number on line 8
is given as a positive number (default) then the maximum of the 3
chi-squares serves as test statistic, otherwise it's the sum of the 3
chi-squares.
17 = chi-square for interaction between given SNP and
all other SNPs. Homogeneity of Delta between cases and controls for
largest of 4 independent Pearson chi-squares (1 df) of interaction.
18 = Max2 test: maximum chi-square for dominant and recessive genotype action.
19 = Homogeneity of correlation coef (via LR chi-square diff.) in cases and controls
20
= chi-squares for 2 x 3 table of genotypes conditioned on genotypes of
test SNP. Test statistic is the max. of -ln(p) for a) sum of 3
chi-squares and b) max. of 3 chi-squares (see code 16). This, this
statistic represents the combination of the two versions of test
statistic #16.
NOTE (statistic codes 2, 11, and 14)
When
applicable, absolute differences between the two groups are computed
(two-sided test). But when the statistic code is given as a negative
number then the difference type 2 minus type 1 observations will be
computed. For example, the test statistic with a code of 11 is
|t|. Also, a test code of -14 will test whether Fcase>Fctrl.
NOTE
Chi-square
is generally computed as a likelihood ratio (LR) statistic. If the
Pearson chi-square is desired, the appropriate program constant needs
to be changed and the program recompiled. Initially, cells in
contingency tables contain the value "empty", which currently is set
equal to 1/2.
3. Seed for random number generator
An input file called seed.txt holds a seed for the random
number
generator. This seed must be a positive integer number. At program
termination, this file will be overwritten with the ending seed so
that seeds are always updated from one to the next program run. We
implemented in our Pascal programs the newest
random
number generator, ran (int64), highly recommended by Press et
al
(2007), which has a period of approximately 1057. We are
grateful to Dr. Quan Long for helping us understand the C-code of this
random number generator.
4. Output file
The file, sumstat.out, will contain results. Most of the output
is
self-explanatory. The main table contains the following items:
- i = rank, consecutive number of sum formed
- Var# = consecutive number of given variable (SNP) in input
file. For example, 7
refers to the 7th variable.
- Stat = statistic used, for example, chi-square with 2 df.
- Sum = all the statistics summed up for SNPs listed in i
= 1 through the current value of i for the given SNP.
- pStat = p-value for given statistic, adjusted for multiple
testing
- p0Stat = nominal p-value, not adjusted for multiple testing
- pStat = p-value for given statistic, adjusted for multiple
testing
- pSum = p-value for the given sum, adjusted for multiple
testing, not yet adjusted for testing different sets of SNPs.
- The final p-value at the bottom is generally different from
any p-value listed in the table. It represents the significance level
associated with the smallest pSum value, which is taken to be the
single experiment-wise test statistic.
The file, statout.txt, will
contain values of permutation samples. This
file can potentially be very large. It is required as input to the
statpval program (see below)
but will not
generally be of interest to the user.
Input and output for the statpval program
This program works with two text files:
1. Input file, statout.txt, as
created by the sumstat program. It consists of a text file with (s
+ 1) rows and m columns, where s = number of
permutation samples and m
= maximum number of terms in the sum of test statistics. The table
entries are sum statistics except that the last row contains observed
p-values associated with sum statistics of given numbers of
terms.
2. Output file, statpval.out,
reporting the resulting significance
level associated with pmin.
Compiling
Source code may be compiled with the Free Pascal compiler, FPC. A
batch file, freepc.bat, is
provided to compile these programs so they reserved a stack space of 8
MB.
For example, issue the command (in a Command Window), freepc sumstat.
Running the programs
Initially you need to decide which of sumstatR and sumstatS you want to use. Assume this is sumstatS; then you need to copy sumstatS.exe to sumstat.exe.
You cannot simply click on program names. Instead you need to open
a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In
earlier Windows versions, click on Start,
then on Run... and type cmd.
It is also a good idea to change one of the standard features in
Windows: Make sure you see extensions of known file types on your
screen; for example, you should see sumstatS.exe not just sumstatS. In
Windows, the batch file, gosum.bat,
may be called with the command,
gosum xx, in which case sumstat will be run with a parameter
file
called inputxx.txt (previously
prepared by user), and the statpval
program will immediately
afterwards compute overall significance levels and append them to the
output. The resulting output file is sumstatxx.out
and intermediate files such as statout.txt
will automatically be deleted.
In Unix or Linux, a convenient way of running sumstat and statpval one
after the
other and in the background is to prepare a file with the following
lines, for example:
sumstat <input.txt >out
statpval >>out
cat statpval.out >>
sumstat.out
Make this file executable with the command, chmod +x gosum.
Issue the command, gosum&.
After the program terminates, the out
file will contain information on program progress (you may cat the out
file while the program is running) and may be discarded. Results are
found in the output file called sumstat.out.
Be sure to copy it to
some other file (name) so it won't be overwritten the next time you run
the programs.
The programs, written in Free
Pascal, are available for Windows. They may also be compiled under
Linux.
Sample data
Generated Data
The following data set (sumstatSample.dat)
was generated on the computer: 500
SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci
at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold
model, heritability for all 10 disease loci combined = 0.50, population
trait prevalence = 0.01.
An input parameter file, inputsample.txt,
is provided. The sample data may be run with the command,
gosum Sample.
The corresponding output file is sumstatSample.out.
sumstatQ program
This program version is patterned after an earlier version of sumstatS
and allows for quantitative traits as observed phenotypes.
For example, on each individual, hypertension may be observed as a QTL
(blood pressure). Multiple such measurements can be handled by the
program. For a given QTL and SNP, the following test statistics are
available:
- 19 = 1-way ANOVA for differences of QTL means among the
three SNP genotypes
- 20
= MAX2 test, the larger of two t-tests for the mean difference between
two genotype groups, (AA + AB) versus BB and AA versus (AB + BB).
If m QTLs are
observed on an individual, there must be corresponding m lines of observations appended to
the genotypes in the sumstat.dat
file. The program will take the largest QTL-specific test statistic for
a given SNP as that SNP's test statistic (Manly 2007). No missing
observations are allowed for QTLs (but genotypes may be missing as in
all program versions).
The parameter input file is slightly different for sumstatQ than for the other sumstat programs. Here is a sample
parameter file with briefexplanations. More detailed explanations may be found in the inputSample.txt file that comes with this program.
sumstatQ: Sample data, generated. QTL phenotype is 0/1
0 user-defined subset of variables
1 0 codes for genotypes, missing
19 1 code for test statistic, number of QTLs
0 subsets
15 max # terms in sum
10000 number of permutation samples
0 list number of test-SNP for interaction. Below: Input file name
sumstatSample.dat
0 code for output of observed and randomized test statistics
QC (QCQ) program
This quality control program takes as input a file in sumstat format, that is, a file
with rows as SNPs and columns as individuals (QC is the program for sumstat, QCQ is specifically for sumstatQ). The program filters out
SNPs according to the following (user-defined) criteria:
- Too high a proportion of missing individuals
- Only one genotype present
- MAF (minor allele frequency) too low.
It
also disregards individuals with too high a proportion of missing SNPs
and will write a new file with a name to be determined by the user. If
you initially keep your data in plink format it may be preferable to do
your QC in that program, which allows for steps of QC not found here,
for example, test for relatedness. After QC you may then want to
convert your file to sumstat format to take advantage of the features
of this program. The QC program is easiest to use with the command, QC <QCparam.txt.
References
Chung
JH, Fraser
DAS (1958) Randomization tests for a multivariate two sample problem.
Journal
of the American Statistical Association
53:729-735
de
Quervain DJ,
Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C,
Nitsch RM,
Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic
susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an
example of the use of sum statistics]
Edgington ES, Onghena P (2007) Randomization Tests, 4th edition.
Chapman & Hall/CRC, Boca Raton
Hoh
J, Ott J
(2003) Mathematical multi-locus approaches to localizing complex human
trait
genes. Nat Rev Genet 4:701-709.
Hoh
J, Wille A,
Ott J (2001) Trimming, weighting, and grouping SNPs in human
case-control
association studies. Genome Res 11:2115-2119. [main reference for the
method described here]
Hoh
J, Wille A,
Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs
in
two-stage analysis of disease association data: a model-free approach.
Ann Hum
Genet 64:413-417. [a nested bootstrap approach to SNP selection]
Kim
S, Zhang K,
Sun F (2003) Detecting susceptibility genes in case-control studies
using set
association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum
statistics]
Klein
RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK,
Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J,
Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in
Age-Related Macular Degeneration. Science 308:385-389
Manly
BFJ (2006)
Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman
&
Hall/CRC, New York
Nijenhuis
A, Wilf
HS (1978) Combinatorial algorithms for computers and calculators.
Academic
Press, New York
Ott
J, Hoh J
(2003) Set association analysis of SNP case-control and microarray
data. J
Comput Biol 10:569-574
Papassotiropoulos
A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM,
Hock C
(2005) A cluster of cholesterol-related genes confers susceptibility
for
Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the
use of sum statistics]
Press
WH,
Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd
edition:
The art of scientific computing. Cambridge University Press, Cambridge,
UK; New
York
Weir
BS (1996)
Genetic data analysis II : methods for discrete population genetic
data.
Sinauer Associates, Sunderland, Mass.
Wille
A, Hoh J,
Ott J (2003) Sum statistics for the joint detection of multiple disease
loci in
case-control association studies with SNP markers. Genet Epidemiol
25:350-359.
Zee
RY, Hoh J,
Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L,
Zangenberg G,
Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J,
Lindpainter K
(2002) Multi-locus interactions predict risk for post-PTCA restenosis:
an
approach to the genetic analysis of common complex disease.
Pharmacogenomics J
2:197-201
Zhang Q, Wang S, Ott J (2008) Combining identity by descent and
association in genetic case-control studies. BMC Genet 9:42