Josephine Hoh
Jurg Ott
Yale University, New Haven
Institute of Psychology CAS, Beijing
josephine.hoh@yale.edu
jurg.ott@gmail.com
27 March 2011
http://linkage.rockefeller.edu/ott/

S statistic in gene mapping

(Sumstat and Statpval computer programs)

Methodology

The general approach is described in Hoh et al. (2001). Briefly, consider a number of marker loci in the genome. At each marker, genotypes are available for two types of observations (but see below for quantitative trait phenotypes), for example:
In the current implementation, marker loci must be SNPs (two alleles each). The programs come in different versions (click to download):
The sumstatR and sumstatS programs come in one package.
The problem handled here is discussed in Manly (2007) under the heading of Comparison of Sample Mean Vectors. The use of a sum of univariate test statistics (which is what we do here) as a multivariate test statistic for a randomization test seems to have been first proposed by Chung and Fraser (1958).

Permutation tests

For each of our sum statistics, determining the associated empirical significance level is mathematically intractable, which is why we perform permutation (randomization) tests (Manly 2007). Under the hypothesis of no association, any permutation of the labels for the two outcome types (case, control) is expected to be equally likely. Because of the potentially very large number of possible permutations, we take a random sample of permutations and approximate empirical significance levels. Based on the given array of case and control labels in the observed data, random permutations of labels are generated sequentially according to an algorithm by Nijenhuis and Wilf (1978). Specifically, let Sobs be the sum statistic obtained in the observed data for a given number of SNPs. In each permutation sample (same genotypes as observed but labels "affected" and "unaffected" permuted), we compute a sum statistic, Sperm, in the same way as in the observed data. The proportion of permutation samples for which SpermSobs is then an estimate for the empirical significance level associated with Sobs.

A second level of permutations is performed (in the statpval program) as follows. Assume that we evaluate N sum statistics, that is, the N-th sum statistic contains the test statistics summed over N SNPs (N = 15 is generally a suitable number). For each sum statistic with a given number of terms, the associated empirical significance level is determined. We then take the smallest of these as our single overall statistic of interest and must determine its associated (experiment-wise) significance level. This is done on the basis of the permutation samples obtained in level (1) (Manly 2007, section 6.8): Each of these samples in turn is taken as an "observed" (test) dataset and the remaining permutation samples are used to evaluate the significance level of the smallest p-value in the "observed" permutation sample.

Some authors argue that under the null hypothesis the observed data should be included in the randomized data but we don't do this here.

With missing observations, it may happen for one or more variables that a permutation sample contains observations on only one class (either cases or controls). Such ill-conditioned permutation samples will be skipped.

Implementation

The approach described above has been implemented in two computer programs, sumstat and statpval, where sumstat computes sum statistics and their associated p-values, and statpval evaluates the (experiment-wise) significance level associated with the smallest of these p-values. All p-values are evaluated on a genome-wide basis (i.e. for as many input variables as there are in the study). The user may restrict attention to a subset of variables, for example, the markers on a given chromosome.

Input and output for Sumstat programs

This section refers to sumstatR and sumstatS. The sumstat program works with the following two files, both in text (ASCII) format. Note new input format.

1. Data file

This input file (named in the parameter file below) should consist of a matrix with K + 1 rows and N columns, where N = number of observations (e.g., case and control individuals) and K = number of input variables (eg. SNPs). Each cell in the matrix contains a genotype code. The last row contains the code for the type of observation (e.g., 2 = case and 1 = control).
Optionally, the N input observations on each row (for a given marker) may be followed by three quantities: (1) Chromosome number, (2) position in base pairs of the marker on the chromosome, and (3) marker name. These items will allow for an easy ordering of markers by their chromosomal positions, for example, with the sort command in Windows (not really relevant for sumstat, but for scanstat). On the last input line, you may use the following dummy values: (1) 99, (2) 99, (3) xx.

NOTE: Do not use marker names starting with x, &, %, $, followed by digits or else the program will interpret these as numbers to a base other than 10. For example, x11 will be interpreted as a hexadecimal number equal to (decimal) 17. If you must use x plus digits then use xx plus digits, for example, xx11. The program will recognize this as text, which it must do to properly count individuals.

Your data may be in plink format, in which case you may use the included program, p2s, to convert the plink files to sumstat format. Using plink and assuming you have files called data.ped and data.map, you first must produce transposed datasets with the command,
  plink --file data --transpose --recode12
Then run the conversion program, p2s, and follow instructions.

2. Parameter file

The sumstat program reads parameter values interactively from the standard input (i.e. the keyboard). In practice, it is convenient to collect these values in a file named inputxx.txt and call the program with the command, gosum xx (see below). An example for such a parameter file is as follows:

sumstat: Sample data, generated. Heritability 0.50, threshold 2.32
1 0    codes for genotypes and missing
1 1    code for test statistic, lambda
15     max # terms in sum
2000   number of permutation samples
0      list number of test-SNP for interaction. Below: Input file name
sumstatSample.dat

The various lines of input specify the following parameter values (this text is also contained in the inputSample.txt file):

Line 1: Any text up to 250 characters.

Line 2:
  Code for genotypes given in the data:
     -2 = any observed values     (no adjustments will be made)
     -1 = the codes are -1, 0, 1     (genotype codes will
      0 = the codes are 0, 1, 2      (be adjusted to 1, 2, 3
      1 = the codes are 1, 2, 3      (internally.
  Value for "missing"

Line 3:
  Code for single-locus test statistic (for details, see below)
  Lambda value for genomic control (if statistics code=1 or code=12)

Line 4:
  Maximum number of terms in sum

Line 5:
  Number of permutation samples

Line 6:
  List number of test SNP, which will be paired with all other SNPs in turn (statistics 15, 16, 17, 19, 20). For statistic code 16, if SNP number is given as a positive number then the maximum of the 3 chi-squares serves as test statistic, otherwise it's the sum of the 3 chi-squares.

Line 7:
  Name of input file. No other characters before or after the name.

----------------------------------------

Codes on line 4:

 1 = chi-square for 2 x 3 table, case-control versus 3 genotypes
 2 = difference in mean codes, group 2 vs. group 1. SEE NOTE BELOW.
10 = Armitage test for trend
11 = t-test. SEE NOTE BELOW.
12 = chi-square for 2 x 2 table of alleles, case-control versus 2 alleles
14 = chi-square for differences in allele frequencies and F values. SEE NOTE BELOW.
15 = chi-square for interaction between given SNP and all other SNPs. Homogeneity of odds ratio between cases and controls for largest of 4 independent 1-df chi-squares.
16 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP. If SNP number on line 8 is given as a positive number (default) then the maximum of the 3 chi-squares serves as test statistic, otherwise it's the sum of the 3 chi-squares.
17 = chi-square for interaction between given SNP and all other SNPs. Homogeneity of Delta between cases and controls for largest of 4 independent Pearson chi-squares (1 df) of interaction.
18 = Max2 test: maximum chi-square for dominant and recessive genotype action.
19 = Homogeneity of correlation coef (via LR chi-square diff.) in cases and controls
20 = chi-squares for 2 x 3 table of genotypes conditioned on genotypes of test SNP. Test statistic is the max. of -ln(p) for a) sum of 3 chi-squares and b) max. of 3 chi-squares (see code 16). This, this statistic represents the combination of the two versions of test statistic #16.

NOTE (statistic codes 2, 11, and 14)
When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 11 is |t|. Also, a test code of -14 will test whether Fcase>Fctrl.

NOTE
Chi-square is generally computed as a likelihood ratio (LR) statistic. If the Pearson chi-square is desired, the appropriate program constant needs to be changed and the program recompiled. Initially, cells in contingency tables contain the value "empty", which currently is set equal to 1/2.

3. Seed for random number generator

An input file called seed.txt holds a seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with the ending seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator.

4. Output file

The file, sumstat.out, will contain results. Most of the output is self-explanatory. The main table contains the following items:
The file, statout.txt, will contain values of permutation samples. This file can potentially be very large. It is required as input to the statpval program (see below) but will not generally be of interest to the user.

Input and output for the statpval program

This program works with two text files:
1. Input file, statout.txt, as created by the sumstat program. It consists of a text file with (s + 1) rows and m columns, where s = number of permutation samples and m = maximum number of terms in the sum of test statistics. The table entries are sum statistics except that the last row contains observed p-values associated with sum statistics of given numbers of terms.
2. Output file, statpval.out, reporting the resulting significance level associated with pmin.

Compiling

Source code may be compiled with the Free Pascal compiler, FPC. A batch file, freepc.bat, is provided to compile these programs so they reserved a stack space of 8 MB. For example, issue the command (in a Command Window), freepc sumstat.

Running the programs

Initially you need to decide which of sumstatR and sumstatS you want to use. Assume this is sumstatS; then you need to copy sumstatS.exe to sumstat.exe.

You cannot simply click on program names. Instead you need to open a command (DOS) box, for example, by clicking on Start and then on Command Prompt (in Windows XP). In earlier Windows versions, click on Start, then on Run... and type cmd. It is also a good idea to change one of the standard features in Windows: Make sure you see extensions of known file types on your screen; for example, you should see sumstatS.exe not just sumstatS. In Windows, the batch file, gosum.bat, may be called with the command, gosum xx, in which case sumstat will be run with a parameter file called inputxx.txt (previously prepared by user), and the statpval program will immediately afterwards compute overall significance levels and append them to the output. The resulting output file is sumstatxx.out and intermediate files such as statout.txt will automatically be deleted.

In Unix or Linux, a convenient way of running sumstat and statpval one after the other and in the background is to prepare a file with the following lines, for example:
sumstat <input.txt >out
statpval >>out
cat statpval.out >> sumstat.out
Make this file executable with the command, chmod +x gosum.
Issue the command, gosum&. After the program terminates, the out file will contain information on program progress (you may cat the out file while the program is running) and may be discarded. Results are found in the output file called sumstat.out. Be sure to copy it to some other file (name) so it won't be overwritten the next time you run the programs.

The programs, written in Free Pascal, are available for Windows. They may also be compiled under Linux.

Sample data

    Generated Data

The following data set (sumstatSample.dat) was generated on the computer: 500 SNP markers, 250 cases, 250 control individuals, 10 susceptibility loci at SNPs 1 through 10 obtained via Hartl and Clark's liability threshold model, heritability for all 10 disease loci combined = 0.50, population trait prevalence = 0.01. An input parameter file, inputsample.txt, is provided. The sample data may be run with the command, gosum Sample. The corresponding output file is sumstatSample.out.

sumstatQ program

This program version is patterned after an earlier version of sumstatS and allows for quantitative traits as observed phenotypes. For example, on each individual, hypertension may be observed as a QTL (blood pressure). Multiple such measurements can be handled by the program. For a given QTL and SNP, the following test statistics are available:
If m QTLs are observed on an individual, there must be corresponding m lines of observations appended to the genotypes in the sumstat.dat file. The program will take the largest QTL-specific test statistic for a given SNP as that SNP's test statistic (Manly 2007). No missing observations are allowed for QTLs (but genotypes may be missing as in all program versions).

The parameter input file is slightly different for sumstatQ than for the other sumstat programs. Here is a sample parameter file with briefexplanations. More detailed explanations may be found in the inputSample.txt file that comes with this program.

sumstatQ: Sample data, generated. QTL phenotype is 0/1
0      user-defined subset of variables
1 0    codes for genotypes, missing
19 1   code for test statistic, number of QTLs
0      subsets
15     max # terms in sum
10000  number of permutation samples
0      list number of test-SNP for interaction. Below: Input file name
sumstatSample.dat
0      code for output of observed and randomized test statistics

QC (QCQ) program

This quality control program takes as input a file in sumstat format, that is, a file with rows as SNPs and columns as individuals (QC is the program for sumstat, QCQ is specifically for sumstatQ). The program filters out SNPs according to the following (user-defined) criteria:
It also disregards individuals with too high a proportion of missing SNPs and will write a new file with a name to be determined by the user. If you initially keep your data in plink format it may be preferable to do your QC in that program, which allows for steps of QC not found here, for example, test for relatedness. After QC you may then want to convert your file to sumstat format to take advantage of the features of this program. The QC program is easiest to use with the command, QC <QCparam.txt.

References

Chung JH, Fraser DAS (1958) Randomization tests for a multivariate two sample problem. Journal of  the American Statistical Association 53:729-735

de Quervain DJ, Poirier R, Wollmer MA, Grimaldi LM, Tsolaki M, Streffer JR, Hock C, Nitsch RM, Mohajeri MH, Papassotiropoulos A (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13:47-52 [an example of the use of sum statistics]

Edgington ES, Onghena P (2007) Randomization Tests, 4th edition. Chapman & Hall/CRC, Boca Raton

Hoh J, Ott J (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4:701-709.

Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11:2115-2119. [main reference for the method described here]

Hoh J, Wille A, Zee R, Cheng S, Reynolds R, Lindpaintner K, Ott J (2000) Selecting SNPs in two-stage analysis of disease association data: a model-free approach. Ann Hum Genet 64:413-417. [a nested bootstrap approach to SNP selection]

Kim S, Zhang K, Sun F (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4 Suppl 1:S9 [documents increased power of sum statistics]

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308:385-389

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Nijenhuis A, Wilf HS (1978) Combinatorial algorithms for computers and calculators. Academic Press, New York

Ott J, Hoh J (2003) Set association analysis of SNP case-control and microarray data. J Comput Biol 10:569-574

Papassotiropoulos A, Wollmer MA, Tsolaki M, Brunner F, Molyva D, Lutjohann D, Nitsch RM, Hock C (2005) A cluster of cholesterol-related genes confers susceptibility for Alzheimer's disease. J Clin Psychiatry 66:940-947 [an example of the use of sum statistics]

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York

Weir BS (1996) Genetic data analysis II : methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass.

Wille A, Hoh J, Ott J (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25:350-359.

Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-Ortiz A, Macaya C, Pintor E, Fernandez-Cruz A, Ott J, Lindpainter K (2002) Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. Pharmacogenomics J 2:197-201

Zhang Q, Wang S, Ott J (2008) Combining identity by descent and association in genetic case-control studies. BMC Genet 9:42