Jurg Ott / 10 May
2006
ott@rockefeller.edu
http://www.genemapping.cn
This program
package estimates haplotype frequencies in pools of DNA from two
individuals
each [1]. This is accomplished with the following steps:
The following
files are contained in the “pools2.zip” file:
add2data.txt
Original dataset for add2
gene
GNUlicense.txt
GNU general license for use
of POOLS2 program package
hapinf.exe
Executable for hapinf
program
hapinf.for
Source code for hapinf
program
hapinf.out
Sample output file, hapinf
program (add2 data)
hapinf.txt
Sample input file
for hapinf program
htdata.out
Sample output, htSNP
program
htdata.txt
Sample input for htSNP
program (add2 data)
htSNP.py
Code for htSNP
program
Readme.txt
Mentions web site with
detailed explanations
ReadmeHapinf.txt Explanations
to hapinf program
Readme-htSNP.txt Explanations
to htSNP program
seed.txt
Contains seed for
random number generator
snp.exe
Executable for snp
program
snp.pas
Souce code for snp
program
snpdata.txt
Sample input file to snp
program
snpeh.txt
Sample output from snp
program
snpfinal.txt
Sample output from snp
program
snpfreepas.pas
Part of source code of snp
program
snpout.txt
Sample output from snp program
The user must
prepare an input file called “snpdata.txt” with the following
specifications:
Line 1
Number of pools used (must be the same for all data in this file). Optionally this number may be followed by arbitrary text.
Line 2
Gene ID (an integer number)
SNP ID (an integer number)
Phenotype code for pool 1
(integer)
Phenotype code for pool 2, etc.
Repeat line 2
for each new SNP and for as many genes as desired. SNPs in the same
gene must
have same gene ID. It is ok to use the same gene ID at different places
in the
file. For example, if gene ID 335 is followed by gene ID 638 and then
again by
gene ID 335, the program will look at this as three different genes.
The data matrix
now consists of a number of rows (= SNPs) and columns (= pools).
Pool phenotype
codes are as follows, with X and Y designating two SNP alleles:
|
Pool phenotype |
Code |
Explanation |
|
XXXX |
0 |
All alleles are of the X type |
|
YYYY |
1 |
All alleles Y type |
|
XXYY |
2 |
Equal number of X and Y alleles |
|
XXXY |
3 |
3 X alleles, 1 Y allele |
|
XYYY |
4 |
1 X allele, 3 Y alleles |
|
XY? |
5 |
X and Y alleles both present but in unknown
numbers |
|
XYZ |
6 |
3 alleles, will be treated as N |
|
N |
-1 |
No data (unknown) |
For the random
number generator, the user must prepare a file called “seed.txt”
containing a
negative integer number, i.e., the random number seed. This file will
be
updated with each successive run of the snp
program.
Open a Windows
command window (“DOS box”) and change directories until you are in the
directory containing the snp
program and the input files. Then simply type snp. The program will then impute genotypes as
described in [1]. A given SNP will be considered missing entirely (and
will not
be processed further) if the number of missing pool phenotypes is
larger than
the highest frequency of any of the known phenotypes at that SNP.
The program then displays program
constants, which are currently set as follows:
Max. number of
pools = 100
Max. number of
SNPs per gene = 200
Max. number of
genotype vectors per gene = 30,000
Max. number of
interations = 99. If successive iterations differ by ss < 10-11 then
iterations will stop early,
where
ss
= sum of squared deviations between old and new genotype vector
frequencies.
As the program
is running it announces each gene and the number of valid (non-missing)
SNPs at
that gene, also the total number of genotype vectors compatible with
the pool
phenotypes. If that number exceeds the maximum set by the corresponding
program
constant, the given SNP will be skipped (it would take too long to run
anyway).
Then the program reports progress at each iteration and displays the
sum of
squared deviations (ss) of genotype vector frequencies between the
current and
the previous iteration.
The snp
program writes three output files, “snpout.txt”,
“snpfinal.txt”, and “snpeh.txt”.
This file will
contain “cleaned” data with imputed pool phenotypes and an indication
of which
SNPs are missing.
This file
contains, for each gene, a list of estimated individual genotypes. The
following items (columns) will be provided for each individual (output
line):
Gene number
Genotype code for SNP 1
Genotype code for SNP 2, etc.,
where codes 1,
2, and 3 represent genotypes 1/1, 1/2, and 2/2, respectively. After
some modification
as indicated, this file is suitable for input to the hapinf program (see below).
3.
snpeh.txt
This file is
analogous to the previous file but is formatted for input to the EHplus program [2]. The
following items
(columns) are provided for each individual (output line):
ID for individual. Example: 3.119
(meaning individual 3 at gene 119)
A fixed code of 0 meaning “control
individual”
Allele 1 at SNP 1
Allele 2 at SNP 1
Allele 1 at SNP 2
Allele 2 at SNP 2, etc.
The program has
been developed by Dr. Andrew Clark [4]. He graciously allowed me to
include his
program with my programs in this package.
The program requires an input
file called “hapinf.txt” (for an example
see Analysis of sample data,
below). Then run the hapinf
program by typing hapinf
in the
command window. The program will produce an output file called
“hapinf.out”.
This program is
written in the Python language. To run it you must have the Python
system
installed, for example, on your PC under Windows. See
http://www.python.org/.
For a brief list of commands, type htSNP.py. The
input file consists of a number of rows (haplotypes) and columns
(SNPs), with
an optional header line containing any text. Entries in the input data
matrix
are alleles coded as 0 and 1.
This sample
dataset consists of 16 pools genotyped at 25 SNPs (“add2data.txt”
file). In [1]
we use only the first 15 SNPs, resulting in the “snpdata.txt” file
(input to snp
program). Run the snp
program by typing snp.
The resulting output file,
“snpfinal.txt”, is then modified (first 7
lines deleted) and saved as “hapinf.txt”, which serves as input to the hapinf program. Run it by
typing hapinf.
The resulting output file,
“hapinf.out”, after the title SampleID
HapCode Haplotype, contains
haplotypes assigned to individuals. In ambivalent situations more than
two
haplotypes may be shown for a given individual (this will be remedied
in a
future program version), in which case the user may want to use his
judgment as
to which haplotypes to assign. For the given dataset, no ambiguities
exist. The
last section contains the list of unambiguously assigned haplotypes. To
determine haplotype-tagging SNPs, save that section of text, for
example, as
“htdata.txt”, but delete the first two columns (Code and Number).
Then type the command htSNP.py
htdata.txt htdata.out. The “htdata.out”
file then shows that 6 is the
smallest number of haplotype tagging SNPs, and there are 10 different
such sets
of SNPs.
[1] Hoh J, Matsuda F, Peng X,
Markovic D, Lathrop MG, Jurg Ott J: SNP
haplotype tagging from DNA pools of two
individuals. BMC Bioinformatics
2003 4:14
[2] Zhao JH,
Curtis D, Sham PC: Model-free
analysis and
permutation tests for allelic association. Hum Hered 2000, 50:133-139
[3]
http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt
[4] Clark AG: Inference of haplotypes
from PCR-amplified samples of diploid populations. Mol Biol Evol 1990, 7:111-122