Jurg Ott / 20 July 2006
Rockefeller University New York
http://www.genemapping.cn
Email: ott@rockefeller.edu

Documentation to SIMULATE program

Originally written by Joe Terwilliger, SIMULATE is a computer program to simulate genotypes in family members for a map of linked markers unlinked to a given affection status locus. Output from this program is in SLINK format and is ready for analysis with UNKNOWN, ISIM, LSIM, or MSIM of the SLINK package. The program is written in Free Pascal for Windows and Linux and comes in the following two versions:

(1) SIMULATE is essentially the original version, in which all marker genotypes are generated based on population marker characteristics (allele frequencies, etc.) and recombination fractions between them.

(2) SIMULATE2 assumes that for founder individuals with known genotypes (i.e., they are "typed") the original (observed) genotypes are provided. These genotypes will not be modified (generated) in the course of the simulations. For founders with unknown genotypes ("untyped"), marker genotypes will be generated as in the SIMULATE program. However, prior to running SIMULATE2, users may want to impute such genotypes from the original data so that few founders have unknown genotypes. The two program versions have somewhat different input requirements. For example, the random number generator is different -- SIMULATE requires 3 seeds and SIMULATE2 requires only one.

LOCUS TYPES

All locus types from the LINKAGE programs can be simulated with the following exceptions:
The SIMULATE2 program can only handle marker loci and optionally a trait locus as the first locus.

INPUT FILES

This program requires three input files as follows:

1. SIMDATA.DAT ­ a Standard LINKAGE parameter file (datafile), specifying the map of markers (chromosome order = file order). Note that, as in SLINK, the last input line is not relevant. However, the input line before last specifies recombination fractions between loci.

2. SIMPED.DAT ­ a post-MAKEPED LINKAGE pedigree file with an additional line at the top specifying the following numbers:
For subsequent input lines (one line per individual), different rules apply to the SIMULATE and SIMULATE2 programs:

SIMULATE. The fields for id's, sex, and probands are as in standard LINKAGE pedigree files. However, since this program simulates all marker loci, you must only have one digit per marker in the pedigree file (no marker genotypes, or 0 0, as in SLINK; see exception for SIMULATE2 below) to tell whether that marker is to be simulated or left unknown in that individual. A 0 means that marker should be untyped, and a 1 means it should be typed (simulated). See below for locus order and types of loci. That is, each marker may be designated as being known or unknown, not only each individual as in SLINK.

SIMULATE2. Depending on whether an individual is a founder (no parents in pedigree) or non-founder, and whether genotypes are known or not, marker genotypes are coded as follows (after an initial optional affection status locus, only marker genotypes are permitted). Note that the coding scheme below is analogous to that in SLINK but different from that in SIMULATE.
3. PROBLEM.DAT ­ A file containing the following numbers:
At the end of the simulation, the program writes a new seed to PROBLEM.DAT such that SIMULATE/2 may be called repeatedly and each time continue with an updated seed.

OUTPUT FILES

The program creates the following output files:

LIMITATIONS AND COMPILATION

To compile the program with the Free Pascal compiler, a batch file is provided that will invoke several compiler options (low level of optimization) that are relevant for SIMULATE. Use it by giving the command COMPILE SIMULATE in a command window. Sample input files are provided.

In the LINKAGE programs, allele frequencies at a given locus may or may not sum to 1. In SIMULATE, however, they must sum to 1 for proper allocation of alleles in founder individuals. Thus, a check has been implemented, and the program will issue a warning when allele frequencies do not sum to 1.

BATCH RUNS

Occasionally a user may want to run SIMULATE repeatedly, each time producing only one replicate of simulations, and each time analyzing that replicate and storing some statistic obtained in the analysis. A typical case would be, for example, that in some pedigrees a sib pair analysis (SPA) was carried out resulting in a formal p-value, Pobs, that was obtained according to the asymptotic distribution of the particular test statistic in the SPA. The user wants to know the empirical p-value associated with the formal p-value, that is, the probability, Pval, that under no linkage the analysis furnishes a formal p-value as small as or smaller than Pobs. This probability may be approximated by computer simulation in that replicates of the analysis as originally done are carried out n times (marker genotypes are generated by SIMULATE, with allele frequencies being the same as in the original analysis). Each replicate then furnishes a formal p-value. The proportion, Prepl, of those replicates with a formal p-value as small or smaller than Pobs is an approximation to the true empirical significance level, Pval. With the proportion, Prepl, and the number, n, of replicates, a confidence interval for Pval may be obtained with the aid of the BINOM program (one of the Linkage Utility programs).

The user must write a computer program to extract, for a given replicate, the relevant quantity/ies (e.g., p-values) from the output of the SPA and append those values to a file, which contains analogous values from previous replicates. Let's call this program READPAR. Repeated simulations may then be carried out as follows: The PROBLEM.DAT file is prepared specifying that 1 replicate should be generated by SIMULATE. After that number 1, on the same line, a number 0 is entered (separated from the 1 by at least one space). In the file called LASTRUN.TXT, on line 1, the desired number of times that SIMULATE should be run is indicated. The sequence of runs is started by entering the command RUNS, which invokes the RUNS.BAT batch program. This batch program (furnished with SIMULATE) contains the following lines:

@echo off
echo ***
echo *** This batch program runs SIMULATE and analysis programs many
echo *** times, as often as is specified in the file, LAST.RUN
echo ***
echo *** To suppress screen output generated by programs, run these
echo *** programs with redirection of standard output to NUL.
echo ***
pause
if exist last.fil erase last.fil
:NEW
simulate
rem The line below shows how to run SIMULATE without screen output
rem simulate > nul
rem Here, run analysis of simulated data *)
rem Here, run program to read analysis output and write appropriate table **)
testlast
if exist last.fil goto END
goto NEW
:END

The two places to be changed by the user are marked as follows:
  *) This line should contain the name of the program carrying out the SPA, e.g., SIBPAL.
**) This line should contain the name of the program, e.g. READPAR, that evaluates the output file generated by the SPA program.

REFERENCES

Terwilliger JD, Speer M, Ott J (1993) Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol 10, 217-224

Terwilliger JD, Ott J (1994) Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore