Josephine Hoh
Jurg Ott
Yale University, New Haven
Rockefeller University, New York
josephine.hoh@yale.edu
ott@rockefeller.edu
4 March 2010
http://linkage.rockefeller.edu/ott/

Scan statistics in gene mapping

This manual describes the scanstat program, available below in source and executable code for Windows. The programs are copyrighted and covered by the GNU General Public License, a copy of which is furnished with the distribution package. They come in three versions described further below.

Methodology

The general approach is described in Hoh and Ott (2000). Briefly, consider a number N of marker loci on the genome. At each marker, genotypes are available for two types of observations, for example:
For the j-th marker, a (single-locus) statistic, Xj, is defined that should discriminate between the two types of observations. For example, Xj might be the difference in lod scores between AA and AU sib pairs.

Let YL(t) be a moving sum of single-locus statistics over L consecutive markers, with the first item in the sum being at the t-th marker and the last element at the (t + L – 1)-st marker. For example, for L = 5 and t = 7, YL(t) is the sum of elements Xi for markers 7 through 11. The linear unconditional scan statistic of length L (Glaz and Balakrishnan 1999) is then defined as the maximum of all possible such moving sums over the genome, that is, as

SL = max{YL(1), YL(2), ..., YL(NL + 1)}.

What we want to do is determine scan statistics of various lengths for a given body of observations and find that scan statistic that is most significant.

The scan statistic of length L = 1 is just the largest single-locus statistic over the genome, that is, the statistic conventionally used in genome screens. Scan statistics of lengths L > 1 make use of the most significant marker and markers around it. In this sense, scan statistics are reminiscent of Bayesian analysis. However, we apply scan statistics in a strict likelihood framework.

Permutation tests

When the maximum lod score in a genome screen exceeds a critical value such as 3.3, such a result is considered significant at the α = 0.05 level even though the empirical significance level for the data at hand has not been determined. For our scan statistics, determining empirical significance levels, p, is mathematically intractable, which is why we resort to permutation (randomization) tests (Manly 1999). Under the hypothesis of no linkage or no association, any permutation of the labels for the two outcome types is expected to be equally likely. Because of the potentially very large number of possible permutations, we take a random sample of permutations and approximate empirical significance levels.

Two levels of permutations are performed. Assume that we evaluate scan statistics of lengths L = 1 through L = Lmax, where we might choose Lmax = 10. (1) For each scan statistic of a given length, the associated empirical significance level, pL, is determined. The smallest of these, pmin = min{p1, p2, ..., pLmax}, is then our statistic of interest. (2) Because we deliberately look for the smallest p-value, we must determine the empirical significance level associated with pmin, which is done on the basis of the permutations samples obtained in level (1) (Manly 1999, page 109) (for details see description of our sumstat program).

This approach has been implemented in two computer programs, scanstat and statpval, where scanstat computes scan statistics of given lengths and their associated significance levels and statpval evaluates the significance level of pmin. The computed p-values represent experiment-wise significance levels.

Program versions

Previously, several versions of this program were made available. These versions are now being made more consistent with each other; currently only the scanstatS version is available, which reads data from disk in a fast manner for each iteration and allows the program to handle large numbers of SNPs and individuals. Also, genetic markers must be SNPs.

Input and output for scanstat programs

(1) Input data file

This file, to be named in the parameter file described below, consists of a matrix with M + 1 rows and N columns, where N = number of observations and M = number of markers. Each cell in the matrix contains an observation, for example, a genotype code. The (M + 1)st row must contain codes of 0 and 1 (or 1 and 2) indicating the two types of observations, where the higher number corresponds to the category of interest, for example, 0 = AU sib pair and 1 = AA sib pair, or 0 = controls and 1 = cases (below, the two categories are numbered 1 and 2). Optionally, the N input variables on each row (for each marker) may be followed by the three quantities: (1) Chromosome number, (2) position in base pairs of the marker on the chromosome, and (3) marker name. These items will allow for an easy ordering of markers by their chromosomal positions, for example, with the sort command in Windows. On the last input line, use the following dummy values: (1) 99, (2) 99, (3) xx, or something analogous.

Based on the observations, the scanstat program will compute single-locus statistics for each marker. The program will then compute scan statistics from these single-locus statistics for the observed data and for each permutation sample. The empirical significance level will be estimated by the proportion of permutation samples exhibiting a scan statistic as large or larger than the corresponding scan statistic in the observed data.

Your data may be in plink format, in which case you may use the included program, p2s, to convert the plink files to scanstat format. Using plink and assuming you have files called data.ped and data.map, you first must produce transposed datasets with the command,
  plink --file data --transpose --recode12
Then run the conversion program, p2s, and follow instructions.

(2) Input parameter file

This file lets the user choose different program options such as what test statistic to apply. These parameter files have a slightly different format for the scanstatR and scanstatS programs. Explanations are furnished in each sample parameter file furnished with the programs. Here, the parameter file for the scanstatS program is discussed. This file must have the name inputX.txt, where X is a name chosen by the user. For example, the parameter file may be called inputSample.txt and contain the following input lines:

scanstatS: Sample data (no chromosomes)
n    Subset of variables (y)? (if so, enter  y  mbegin  mend)
8    Statistic code
10   max. length of scan statistic
2000 number of random samples
0    code for "missing". Below: Input file name (no blanks preceding and following)
Sample.dat

Detailed explanations for each of these input lines is given in the inputSample.txt file furnished with the scanstat program. A list of test statistics available is shown in the next paragraph.

(3) Test Statistics

Below, the currently available test statistics and their codes are provided. Observations (phenotypes) are of type 1 (e.g., controls) or type 2 (e.g., cases).

 1. Mean (against value of 0) in observation type 2 (cases)

 2. Difference between mean in type 2 minus mean in type 1 observations. SEE NOTE BELOW.

 5. t-test. For computational efficiency, the factor (n1 + n2)/(n1 + n2 - 2) is taken to be equal to 1.  SEE NOTE BELOW.

 6. Mean divided by standard error in type 2 data (against  value of0)

 7. Differences in allele frequencies and inbreeding coefficient (FP test) between cases and controls. SEE NOTE BELOW.

 8. Difference of homozygote frequencies cases-controls. SEE NOTE BELOW.

 9. Chi-square for 2 x 3 table of genotypes

10. Chi-square for 2 x 2 table of alleles

NOTE (statistic codes 2, 5, 7, and 8): When applicable, absolute differences between the two groups are computed (two-sided test). But when the statistic code is given as a negative number then the difference type 2 minus type 1 observations will be computed. For example, the test statistic with a code of 5 is |t|. A code of -8 will test whether the proportion of homozygotes is larger in cases than in controls, and a code of 8 will compute chi-square for the difference in homozygote frequencies between cases and controls. Also, a test code of -7 will test whether Fcase > Fctrl.

(3) Input seed file

An input file called seed.txt holds the seed for the random number generator. This seed must be a positive integer number. At program termination, this file will be overwritten with the ending seed so that seeds are always updated from one to the next program run. We implemented in our Pascal programs the newest random number generator, ran (int64), highly recommended by Press et al (2007), which has a period of approximately 1057. We are grateful to Dr. Quan Long for helping us understand the C-code of this random number generator.

(4) Output files

This file, for example, called scanstatSample.out, will contain results.
The file, statout.txt, will contain values of permutation samples. This file can potentially be very large. It is required as input to the statpval program (see below) but will not generally be of interest to the user.

Input and output for the statpval program

This program works with two text files:
 1. Input file, statout.txt, as created by the scastat program. It consists of a text file with (s + 1) rows and m columns, where s = number of permutation samples and m = maximum length of scan statistics. The table entries are scan statistics except that the last row contains observed p-values associated with scan statistics of given lengths.
 2. Output file, statpval.out, reporting the resulting significance level associated with pmin.
The user does not need to know how this program works. The following section describes how to run scanstat and statpval programs in a single command.

Running the programs

The two programs, written in Pascal, are available for Windows and could easily be compiled to run in Linux although some output modifications would have to be made. The programs will run in 64 bit versions of Windows but may also be compiled to run as 64 bit programs. In my experience, however, they are no faster in 64 bit mode than in 32 bit mode but they can potentially address more memory on a 64 bit machine. Send me email if you are interested in 64 bit program versions.

Assume that program and data files reside in the folder (directory), c:\scanstat. Open a command window and issue the commands, c:, and cd \scanstat. Assume you want to run the scanstatS program. Copy its executable file with the command, copy scanstatS.exe scanstat.exe. To run this program, type goscan xx, where xx indicates that the parameter file is called inputxx.txt. For example, with the parameter input file called inputSample.txt, the appropriate command is goscan Sample.

Practical considerations

Depending on the size of the data, the total number of possible permutations can be astronomical. It is then important that the number of random samples be large enough to ensure a somewhat good coverage of the space of all permutations. In our experience, 5,000 or 10,000 are good numbers but 20,000 to 50,000 is better.

There is also a question of what maximum length, Lmax, should be chosen. A large value of Lmax tends to increase the overall p-value. On the other hand, too small a value of Lmax may not find the scan statistic with associated smallest p-value. One of our original recommendations was to choose Lmax at the point of L where the p-value starts increasing. However, this no longer seems to be a beneficial approach. For linkage analyses with a marker spacing of 10 cM, Lmax = 10 appears useful. For smaller marker spacings, Lmax may have to be increased.

References

Glaz J, Balakrishnan N (1999) Introduction to scan statistics. In: Glaz J, Balakrishnan N (eds) Scan statistics and applications: Statistics for Industry and Technology. Birkhäuser, Boston, pp 3-26

Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci U S A 97:9615-9617.

Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall/CRC, New York

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, Cambridge, UK; New York