25 August 2006
Jurg Ott
ott@rockefeller.edu
http://www.genemapping.cn
Manual for Simnuclear program
Version 1.01 (test version) 13 May 1999
Version 1.02, 16 Feb 2000 (bug eliminated: at the start
of the genome, inheritance of grandparental alleles to offspring was
not random and could lead to wrong evidence for linkage)
Version for Windows, Simnuc program:
Long file names not supported by
the gcc compiler (djgpp)
Program developed and written by Harald
Göring
1. Description of program
The program simulates a data set consisting of nuclear pedigrees
ascertained (generated) in such a way that some family members are
affected by a multi-locus trait. The genetic model underlying the trait
is an oligogenic threshold model. As outlined below, the user can vary
the proportion of the overall variance due to each of the underlying
susceptibility loci.
The data set is output in a pre-MAKEPED LINKAGE-format
pedigree file
(with trait phenotypes and marker genotypes for autosomal data). The
user has control over many input parameters such as size of data set,
pedigree size distribution, trait characteristics, ascertainment
scheme, number and size of chromosomes, QTL characteristics, marker
characteristics, and some output options. Input is via the command line
(but can be redirected to a file) and is not checked for input errors.
The initial seed for the random number generator is in file "seed.txt",
which at program termination
will be updated with the final seed. Program output is in an output
file
("out.txt").
2. Algorithm of program
Trait model: A random pedigree is simulated; it is then ascertained or
discarded. The program assumes a quantitative trait, which is taken to
be ~N(0,1) (with a small number of QTLs, the resulting quantitative
phenotype will not be very "normal"). The variance is contributed by
diallelic QTLs (genes underlying the trait) and individual-specific
(VarE.non-shared) and family-specific (VarE.shared)
environmental
factors. The proportion of the variance due to genetic factors, the
heritability, is h2 = ∑i(VarGi), where
VarGi is the
proportion of
variance due to the i-th QTL. Thus, the overall model for the trait
variance is Vartotal = h2 + VarE.shared
+ VarE.non-shared ≡
1. For more
details on such a model, see Hartl DL (1987) A primer of population
genetics, 2nd ed. (paperback) Sinauer Associates, p. 232.
Memory usage: The program may require much memory. All marker genotypes
for all pedigree members in all pedigrees in the data set are stored
simultaneously. For example, for a genome of size 3000 cM, marker
spacing of 1 cM, up to 11 children per family, and a data set
consisting of 100 families, the program stores 3000 markers × 13
individuals × 100 pedigrees × 2 alleles/genotype >
6,000,000 alleles! The memory requirements could be reduced drastically
by just storing and outputting one pedigree at a time.
Program steps
- Simulation of the number of children, based on family
size distribution.
- Trait-locus alleles are simulated randomly for parents,
i.e. absence of LD and presence of HWE is assumed.
- The genome-wide cross-over point processes for all
meioses are simulated. For details, see Terwilliger JD, Speer M, Ott J
(1993) Chromosome-based methods for rapid computer simulation in
human
genetic linkage analysis, Genet Epidemiol 10:217-224. Absence of
interference is assumed: On each gamete, crossovers are placed
according to the Poisson law, that is, with an average distance between
adjacent crossovers of 100 cM, and chromosomes are formed according to
their genetic lengths. — The program internally treats the genome as
one giant entity. The independent segregation of chromosomes is
superimposed on the cross-over point process. For this reason, a small
region at the end of each chromosome (constant empty, set to 1
cM or
less) has no markers and no cross-overs. A switch in parental origin
between 2 consecutive chromosomes (from end of chromosome A to start of
chromosome B) is stored as a "cross-over" in the middle of this empty
region at the end of each chromosome.
- Transmission of trait-locus alleles from parents to
offspring. This is fully determined by the cross-over point process.
- Simulation of family-specific (shared) and
person-specific (non-shared) environmental contributions.
- Calculation of quantitative and qualitative trait
phenotypes for all pedigree members.
- Pedigree ascertainment. If not ascertained, start over
with step 1.
- Simulation of marker alleles in parents. This is done
assuming absence of linkage disequilibrium (between markers and between
markers and QTLs) and presence of HWE.
- Transmission of marker alleles from parents to
offspring. This is fully determined by the cross-over point process.
- Output, including simulations whether a parent is phenotyped
and genotyped and whether, for example, a particular PCR reaction (one
marker in one person) did not work.
3. Input
All input (except for initial seed) is via the command line, but can be
redirected to a file (see section 6, below). Note that the input is not
checked for errors. Notation used for ranges: [a, b] includes the
boundary values; (a, b) excludes them. The user is prompted for the
following input parameters:
- Number of chromosomes
- Lengths of all chromosomes in cM (as many numbers as
there are chromosomes)
- Number of QTLs
- In turn for each QTL:
- chromosome
number where the QTL is located
- position on
chromosome in cM. Must be less than the chromosome length. Note: The
positions of QTLs must be in increasing order in the genome, where the
genome is assumed to start at position 0 on chromosome 1. QTLs may be
linked. Example: Assume two chromosomes of length 100 cM each, and each
containing a trait locus (QTL) at map position 50 cM. Because the first
marker is at position 0, it is the 6th (not 5th)
marker locus that is
at position 50 cM on each chromosome.
- proportion of
variance explained by QTL (VarGi); real (0, 1). It is
necessary that
∑i(Gi) = h2 < 1.
- P(allele
increasing quantitative phenotype); real in (0, 1)
- dominance;
real [ −1, 1]. The three genotypes (++, t+, tt) of each QTL are
associated with respective means of a
− e, a + ed, a + e, where d is
the dominance. For example, d
= 1 indicates a dominant QTL, d
= −1
indicates a recessive QTL.
- VarE.shared. Must have h2 + VarE.shared
≤ 1.
VarE.non-shared is computed in the program from h2
+
VarE.shared +
VarE.non-shared = 1.
- Prevalence of affected phenotype; real (0, 1). This
value is used for ascertainment of families, even when the phenotype is
quantitative (see item 17 below).
- Marker density in cM (i.e. distance between adjacent
markers), in useful relation to chromosome sizes.
- Number n of alleles of
markers (the same for all
markers)
- Marker allele frequencies (furnish n − 1 values). The
sum of these frequencies must be < 1! The frequency of the nth
allele will be computed by the program.
- Maximum number m of
children in a family
- Population frequencies of families with 1, 2, ..., m − 1
children (irrespective of trait). Reals in [0,1]. The sum of these
numbers must be ≤ 1. For example, if there should be exactly 2
children per family, this line should contain only one number, i.e. 0.
- Number of pedigrees in data set
- Minimum number of "affected" sibs for pedigree ascertainment
(positive int.) (this must be compatible with the family size
distribution)
- P(parent phenotyped); real [0, 1], i.e. the expected
proportion of parents with known trait phenotypes. All sibs are assumed
to be phenotyped.
- P(parent genotyped | phenotyped) real [0, 1], i.e. the
expected proportion of phenotyped parents who have known genotypes.
Unphenotyped parents are assumed to be ungenotyped.
- P(a single marker in an individual is NOT genotyped); real
[0, 1].
- Option for phenotype output: 1 = qualitative, 0 =
quantitative
- Option for output of sibs: 1 = only affected sibs, 0 = all
sibs
The initial seed for the random number generator is contained in an
input file (seed.txt)
(positive long integer).
4. Output
The key output (pedigree file) is in an output file (out.txt). This
file is in pre-MAKEPED LINKAGE pedigree file format for autosomal data.
The phenotype is either given as qualitative or quantitative (0
indicates unknown in either case). Unknown marker genotypes are
indicated as "0 0".
The updated seed is output to the file seed.txt, replacing the
original
seed.
A summary of the input parameters is output to the screen and may be
captured into a file by redirection of standard output (see below).
5. Program constants
The program has no limiting constants, and memory is allocated as
needed. If not enough memory can be allocated, the program gives an
error message and aborts.
There are 3 constants (all defined in the beginning of file simnuc.c),
which the user may want to change:
FNSEED: name of file containing seed (currently seed.txt)
FNOUT: name of output file (currently out.txt)
empty: length of region at the end of each chromosome w/o markers
and cross-overs (see above) (currently set to 1 cM)
6. Compiling and running the program
To compile the simnuc program
with the gcc compiler (djgpp), simply
type make.
To run the program, either answer all the queries it issues or prepare
an input in a file. A sample input file, samplein.txt, is provided.
Then type simnuc <
samplein.txt. The output resulting from this
input file is provided as the file, out.txt. To capture the
parameters
used in a file called sampleout.txt,
issue the command simnuc
<
samplein.txt > sampleout.txt.