This manual describes the scanstat
program, available below in source and executable code
for Windows. The programs are copyrighted and covered by the GNU
General Public License, a copy of which is furnished with the
distribution package. They come in three versions described further
below.
Methodology
The general approach is described in Hoh and Ott (2000). Briefly,
consider a number N
of marker
loci on the genome. At each marker, genotypes are available for two
types of observations, for example:
Two types of sibling pairs, affected-affected (AA) and
affected-unaffected (AU)
Case and control individuals
For the j-th
marker, a
(single-locus) statistic, Xj, is defined that
should
discriminate between the two types of observations. For example, Xj
might be the difference in lod scores between AA and AU sib pairs.
Let YL(t) be a moving sum of single-locus
statistics over L
consecutive markers, with the
first item in the sum being at the t-th
marker and the last element at the (t
+ L –
1)-st marker. For
example, for L
= 5 and t = 7,
YL(t) is the sum of elements Xi for markers 7
through 11. The
linear unconditional scan statistic of length L (Glaz and
Balakrishnan 1999) is
then defined as the maximum of all possible such moving sums over the
genome, that is, as
SL = max{YL(1), YL(2),
..., YL(N
– L + 1)}.
What we want to do is determine scan statistics of various lengths for
a given body of observations and find that scan statistic that is most
significant.
The scan statistic of length L
= 1 is just the largest single-locus statistic over the genome, that
is, the statistic conventionally used in genome screens. Scan
statistics of lengths L
>
1 make use of the most significant marker and markers around it. In
this sense, scan statistics are reminiscent of Bayesian analysis.
However, we apply scan statistics in a strict likelihood framework.
Permutation tests
When the maximum lod score in a genome screen exceeds a critical value
such as 3.3, such a result is considered significant at the α = 0.05
level even though the empirical significance level for the data at hand
has not been determined. For our scan statistics, determining empirical
significance levels, p,
is
mathematically intractable, which is why we resort to permutation
(randomization) tests (Manly 1999). Under the hypothesis of no linkage
or no association, any permutation of the labels for the two outcome
types is expected to be equally likely. Because of the potentially very
large number of possible permutations, we take a random sample of
permutations and approximate empirical significance levels.
Two levels of permutations are performed. Assume that we evaluate scan
statistics of lengths L
= 1
through L
= Lmax,
where we might
choose Lmax
= 10.
(1) For each scan statistic of a given length, the associated empirical
significance level, pL,
is determined. The smallest of these, pmin
= min{p1,
p2,
..., pLmax},
is then our
statistic of
interest. (2) Because we deliberately look for the smallest p-value, we must
determine the
empirical significance level associated with pmin,
which is
done on the basis of the permutations samples obtained in level (1)
(Manly 1999, page 109) (for
details see description
of our sumstat
program).
This approach has been implemented in two
computer
programs, scanstat and
statpval,
where scanstat
computes scan statistics
of given lengths and their associated significance levels and statpval evaluates
the significance
level of pmin.
The
computed p-values
represent
experiment-wise significance levels.
Program versions
Previously, several versions of this program
were made available. These versions are now being made more consistent
with each other; currently only the scanstatS
version is available, which reads data from disk in a fast manner for
each iteration and allows the program to handle large numbers of SNPs
and individuals. Also, genetic markers must be SNPs.
Input and output for scanstat
programs
(1) Input data file
This file, to be named in the parameter file described below,
consists of a matrix with M
+ 1 rows and N
columns,
where N =
number of
observations and M
=
number of markers. Each cell in the matrix contains an observation, for
example, a genotype code. The (M
+ 1)st
row must
contain codes of 0 and 1 (or 1 and 2) indicating the two types of
observations, where the higher number corresponds to the category of
interest, for example, 0 = AU sib pair and 1 = AA sib pair, or 0 =
controls and 1 = cases (below, the
two categories are numbered 1 and 2). Optionally, the N input variables
on each row (for each marker) may be followed by the three
quantities: (1) Chromosome number, (2) position in base pairs of the
marker on the chromosome, and (3) marker name. These items will allow
for an easy ordering of markers by their chromosomal positions, for
example, with the sort
command in Windows. On the last input line, use the following dummy
values: (1) 99, (2) 99, (3) xx, or something analogous.
Based on the observations, the scanstat
program
will compute
single-locus statistics for each marker. The program will then compute
scan
statistics from these single-locus statistics for the observed data and
for each permutation sample. The
empirical significance level will be estimated by the proportion of
permutation samples exhibiting a scan statistic as large or larger than
the corresponding scan statistic in the observed data.
Your data may be in plink
format, in which case you may use the included program, p2s, to convert the
plink files
to scanstat
format. Using plink
and assuming you have files called data.ped
and data.map,
you first must produce transposed datasets with the command, plink
--file data --transpose --recode12
Then run the conversion program, p2s,
and follow instructions.
(2) Input parameter file
This file lets the user choose different program options
such as what
test statistic to apply. These parameter files have a slightly
different format for the scanstatR
and scanstatS
programs. Explanations are furnished in each sample parameter file
furnished with the programs. Here, the parameter file for the scanstatS program
is discussed. This file must have the name inputX.txt, where X is a name chosen
by the user. For example, the parameter file may be called inputSample.txt and
contain the following input lines:
scanstatS: Sample
data (no chromosomes) n
Subset of variables (y)? (if so, enter y
mbegin mend) 8
Statistic code 10 max.
length of scan statistic 2000 number
of random samples 0
code for "missing". Below: Input file name (no blanks preceding and
following) Sample.dat
Detailed explanations for each of these input lines is given in the inputSample.txt
file furnished with the scanstat
program. A list of test statistics available is shown in the next
paragraph.
(3) Test Statistics
Below, the currently available test statistics and their codes are
provided. Observations (phenotypes) are of type 1 (e.g., controls) or
type 2 (e.g., cases).
1. Mean (against value of 0) in observation type 2 (cases)
2. Difference between mean in type 2 minus mean in type 1 observations. SEE NOTE BELOW.
5.
t-test. For computational efficiency, the factor (n1 + n2)/(n1 + n2 -
2) is taken to be equal to 1. SEE NOTE BELOW.
6. Mean divided by standard error in type 2 data (against value of0)
7. Differences in allele frequencies and inbreeding coefficient (FP test) between cases and controls. SEE NOTE BELOW.
8. Difference of homozygote frequencies cases-controls. SEE NOTE BELOW.
9. Chi-square for 2 x 3 table of genotypes
10. Chi-square for 2 x 2 table of alleles
NOTE
(statistic codes 2, 5, 7, and 8): When applicable, absolute differences
between the two groups are computed (two-sided test). But when the
statistic code is given as a negative number then the difference type 2
minus type 1 observations will be computed. For example, the test
statistic with a code of 5 is |t|. A code of -8 will test whether the
proportion of homozygotes is larger in cases than in controls, and a
code of 8 will compute chi-square for the difference in homozygote
frequencies between cases and controls. Also, a test code of -7 will
test whether Fcase > Fctrl.
(3) Input seed file
An input file called seed.txt
holds the seed for the
random
number
generator. This seed must be a positive integer number. At program
termination, this file will be overwritten with the ending seed so
that seeds are always updated from one to the next program run. We
implemented in our Pascal programs the newest
random
number generator, ran (int64), highly recommended
by Press et
al
(2007), which has a period of approximately 1057.
We are
grateful to Dr. Quan Long for helping us understand the C-code of this
random number generator.
(4) Output files
This file, for example, called scanstatSample.out,
will contain results.
The file, statout.txt,
will
contain values of permutation samples. This file can potentially be
very large. It is required as input to the statpval program
(see below) but
will not
generally be of interest to the user.
Input and output for the statpval program
This program works with two text files:
1. Input file, statout.txt,
as
created by the scastat program. It consists of a text file with (s
+ 1) rows and m columns, where s
= number of
permutation samples and m
= maximum length of scan statistics. The table
entries are scan statistics except that the last row contains observed
p-values associated with scan statistics of given
lengths.
2. Output file, statpval.out,
reporting the resulting significance
level associated with pmin.
The user does not need to know how this program works. The following
section describes how to run scanstat
and statpval
programs in a single command.
Running the programs
The two programs, written in Pascal,
are available for Windows and could easily be compiled to run in Linux
although some output modifications would have to be made. The programs
will run in 64 bit versions of
Windows but may also be compiled to run as 64 bit programs. In my
experience, however, they are no faster in 64 bit mode than in 32 bit
mode but they can potentially address more memory on a 64 bit machine.
Send me email if you are interested in 64 bit program versions.
Assume that program and data files reside in the folder (directory), c:\scanstat.
Open a
command window and issue the commands, c:, and cd \scanstat.
Assume you want to run the scanstatS
program. Copy its executable file with the command, copy scanstatS.exe
scanstat.exe. To run this
program, type
goscan xx,
where xx indicates that the parameter file is called inputxx.txt.
For example, with
the parameter input file called inputSample.txt,
the appropriate command is
goscan Sample.
Practical considerations
Depending on the size of the data, the total number of possible
permutations can be astronomical. It is then important that the number
of random samples be large enough to ensure a somewhat good coverage of
the space of all permutations. In our experience, 5,000 or 10,000 are
good
numbers but 20,000 to 50,000 is better.
There is also a question of what maximum length, Lmax,
should be chosen. A large value of Lmax tends to
increase
the overall p-value.
On the
other hand, too small a value of Lmax may not
find the scan
statistic with associated smallest p-value.
One of our original recommendations was to choose Lmax
at
the point of L
where the p-value
starts increasing. However,
this no longer seems to be a beneficial approach. For linkage analyses
with a marker spacing of 10 cM, Lmax = 10
appears useful.
For smaller marker spacings, Lmax may have to be
increased.
References
Glaz J, Balakrishnan N (1999) Introduction to
scan
statistics. In: Glaz J, Balakrishnan N (eds) Scan statistics and
applications: Statistics for Industry and Technology. Birkhäuser,
Boston, pp 3-26
Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility
genes. Proc Natl Acad Sci U S A 97:9615-9617.
Manly BFJ (2006) Randomization, Bootstrap and Monte Carlo Methods in
Biology. Chapman & Hall/CRC, New York
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical
recipes 3rd edition: The art of scientific computing. Cambridge
University Press, Cambridge, UK; New York