Jurg Ott / 20 July 2006
Rockefeller University New York
http://www.genemapping.cn
Email: ott@rockefeller.edu
Documentation to SIMULATE program
Originally written by Joe Terwilliger, SIMULATE is a computer program
to simulate genotypes in family members for a map of linked markers
unlinked to a given affection status locus. Output from this program is
in SLINK format and is ready for analysis with UNKNOWN, ISIM, LSIM, or
MSIM of the SLINK package. The program is written in Free Pascal for Windows and
Linux and comes in the following two versions:
(1) SIMULATE is essentially the original
version, in which all marker genotypes are generated based on
population marker characteristics (allele frequencies, etc.) and
recombination fractions between them.
(2) SIMULATE2 assumes that for founder
individuals with known genotypes (i.e., they are "typed") the original
(observed) genotypes are provided. These genotypes will not be modified
(generated) in the course of the simulations. For founders with unknown
genotypes ("untyped"), marker genotypes will be generated as in the SIMULATE program. However, prior to
running SIMULATE2, users may
want to impute such genotypes from the original data so that few
founders have unknown genotypes. The two program versions have somewhat
different input requirements. For example, the random number generator
is different -- SIMULATE
requires 3 seeds and SIMULATE2
requires only one.
LOCUS TYPES
All locus types from the LINKAGE programs can be simulated with the
following exceptions:
- At affection status loci to be simulated, only one liability
class is permitted (at affection status loci, which are not to be
simulated, any number of liability classes is permitted).
- At quantitative trait loci, only one trait per locus is permitted.
- If the first locus is an affection status type of locus (common
situation), the program will expect an affection status phenotype in
the pedigree file and simulate the marker map independent of this
affection status locus. However, if the first locus is not affection
status, then it will be simulated as well and the program will just
simulate a linked map of markers with no disease present.
The SIMULATE2 program can only handle marker
loci and optionally a trait locus as the first locus.
INPUT FILES
This program requires three input files as follows:
1. SIMDATA.DAT a
Standard LINKAGE parameter file (datafile),
specifying the map of markers (chromosome order = file order). Note
that, as in SLINK, the last input line is not relevant. However, the
input line before last specifies recombination fractions between loci.
2. SIMPED.DAT a
post-MAKEPED LINKAGE pedigree file with an additional line at the top
specifying the following numbers:
- Number of pedigrees
- Number of individuals in pedigree 1, 2, etc. Terminate this input
line when you do not want to use the option described in the next
bullet. There must be no trailing characters after this list of numbers
except when the optional numbers below are given.The number of
individuals specified must be identical with the number of input lines
per pedigree in the LINKAGE pedigree file, that is, a duplicated
individual in a loop is counted as two individuals. In the pedigree
file, the individuals must be listed sequentially (in the order of
increasing ID number). When a loop is broken by the MAKEPED program,
individuals are not usually in sequential order, but they must be
brought into sequential order before the pedigree file is acceptable to
the SIMULATE program.
- (optional, only for SIMULATE, not for SIMULATE2) A sequence of
0's and 1's, one such number for each locus, where a 1 indicates that
all founder individuals will be assigned genotypes with consecutive
allele numbers, 1/2, 3/4, 5/6, etc. (thus making all founders
heterozygous), and a 0 indicates that founders' genotypes will be
simulated as usual. Warning: if this option is chosen, sequential
numbers will be assigned to all founders at a given locus, irrespective
of the actual numbers of alleles at this locus. Thus, it is the
responsibility of the user to ensure that no allele numbers will be
assigned that exceed the number of alleles at a given locus. Do not use
this option unless you have a specific reason for doing so.
For subsequent input lines (one line per individual), different rules
apply to the SIMULATE and SIMULATE2 programs:
SIMULATE. The fields for id's,
sex, and probands are as in standard
LINKAGE
pedigree files. However, since this program simulates all marker loci,
you must only have one digit per marker in the pedigree file (no marker
genotypes, or 0 0, as in
SLINK; see exception for SIMULATE2 below) to tell whether that marker
is to be simulated or left unknown in that individual. A 0 means that
marker should be untyped, and a 1 means it should be typed (simulated).
See below for locus order and types of loci. That is, each marker may
be designated as being known or unknown, not only each individual as in
SLINK.
SIMULATE2. Depending on whether
an individual is a founder (no parents in
pedigree) or non-founder, and whether genotypes are known or not,
marker genotypes are coded as follows (after an initial optional
affection status locus, only marker genotypes are permitted). Note that
the coding scheme below is analogous to that in SLINK but different
from that in SIMULATE.
- Founder, known (observed) genotypes: The actual genotypes (two
alleles) must be provided. Missing genotypes at some markers are
indicated by 0 0).
- Founder, unknown genotypes: Enter 0 0 for each unknown genotype.
Alternatively (when all genotypes are unknown), the first (and only)
genotype code may be -1. Any numbers following an initial -1 will be
ignored.
- Non-founder, known (simulated) genotypes: A code of 1 1 for each known genotype.
Alternatively (when all genotypes are known), the first (and only)
genotype code may be -2.
- Non-founder, unknown genotypes: A code of 0 0 for each unknown genotype.
Alternatively (when all genotypes are unknown), the first (and only)
genotype code may be -1.
3. PROBLEM.DAT A file
containing the following numbers:
- 3 integer numbers between 1 and 30323 as seeds for the random
number generator
- The desired number of replicates of your pedigree set. There must
be no trailing characters after this number except when the optional
item mentioned below is furnished.
- (optional) The number of runs carried out with SIMULATE or
SIMULATE2 in a batch application (see Batch Runs below).
At the end of the simulation, the program writes a new seed to
PROBLEM.DAT such that SIMULATE/2 may be called repeatedly and each time
continue with an updated seed.
OUTPUT FILES
The program creates the following output files:
- PEDFILE.DAT for analysis by the companion programs to SLINK
- SIMOUT.DAT summarizes the parameters used
- PROBLEM.DAT is rewritten and contains an updated seed and,
optionally, the number of runs incremented by 1 (see Batch Runs below).
LIMITATIONS AND COMPILATION
To compile the program with the Free Pascal compiler, a batch file is
provided that will invoke several compiler options (low level of
optimization) that are relevant for SIMULATE. Use it by giving the
command COMPILE SIMULATE in a command window. Sample input files are
provided.
In the LINKAGE programs, allele frequencies at a given locus may or may
not sum to 1. In SIMULATE, however, they must sum to 1 for proper
allocation of alleles in founder individuals. Thus, a check has been
implemented, and the program will issue a warning when allele
frequencies do not sum to 1.
BATCH RUNS
Occasionally a user may want to run SIMULATE repeatedly, each time
producing only one replicate of simulations, and each time analyzing
that replicate and storing some statistic obtained in the analysis. A
typical case would be, for example, that in some pedigrees a sib pair
analysis (SPA) was carried out resulting in a formal p-value, Pobs,
that was obtained according to the asymptotic distribution of the
particular test statistic in the SPA. The user wants to know the
empirical p-value associated with the formal p-value, that is, the
probability, Pval, that under no linkage the analysis
furnishes a formal p-value as small as or smaller than Pobs.
This probability may be approximated by computer simulation in that
replicates of the analysis as originally done are carried out n times (marker genotypes are
generated by SIMULATE, with allele frequencies being the same as in the
original analysis). Each replicate then furnishes a formal p-value. The
proportion, Prepl, of those replicates with a formal p-value
as small or smaller than Pobs is an approximation to the
true empirical significance level, Pval. With the
proportion, Prepl, and the number, n, of replicates, a confidence
interval for Pval may be obtained with the aid of the BINOM
program (one of the Linkage Utility programs).
The user must write a computer program to extract, for a given
replicate, the relevant quantity/ies (e.g., p-values) from the output
of the SPA and append those values to a file, which contains analogous
values from previous replicates. Let's call this program READPAR.
Repeated simulations may then be carried out as follows: The
PROBLEM.DAT file is prepared specifying that 1 replicate should be
generated by SIMULATE. After that number 1, on the same line, a number
0 is entered (separated from the 1 by at least one space). In the file
called LASTRUN.TXT, on line 1, the desired number of times that
SIMULATE should be run is indicated. The sequence of runs is started by
entering the command RUNS, which invokes the RUNS.BAT batch program.
This batch program (furnished with SIMULATE) contains the following
lines:
@echo off
echo ***
echo *** This batch program runs
SIMULATE and analysis programs many
echo *** times, as often as is
specified in the file, LAST.RUN
echo ***
echo *** To suppress screen
output generated by programs, run these
echo *** programs with
redirection of standard output to NUL.
echo ***
pause
if exist last.fil erase last.fil
:NEW
simulate
rem The line below shows how to
run SIMULATE without screen output
rem simulate > nul
rem Here, run analysis of
simulated data *)
rem Here, run program to read
analysis output and write appropriate table **)
testlast
if exist last.fil goto END
goto NEW
:END
The two places to be changed by the user are marked as follows:
*) This line should contain the name of the program carrying out
the SPA, e.g., SIBPAL.
**) This line should contain the name of the program, e.g. READPAR,
that evaluates the output file generated by the SPA program.
REFERENCES
Terwilliger JD, Speer M, Ott J (1993) Chromosome-based method for rapid
computer simulation in human genetic linkage analysis. Genet Epidemiol 10, 217-224
Terwilliger JD, Ott J (1994) Handbook
of Human Genetic Linkage. Johns Hopkins University Press,
Baltimore