Jurg Ott / 10 May 2006

ott@rockefeller.edu

http://www.genemapping.cn

User's Guide to the pools2 package

 

This program package estimates haplotype frequencies in pools of DNA from two individuals each [1]. This is accomplished with the following steps:

 

Contents of pools2 package

The following files are contained in the “pools2.zip” file:

 

add2data.txt      Original dataset for add2 gene

GNUlicense.txt    GNU general license for use of POOLS2 program package

hapinf.exe        Executable for hapinf program

hapinf.for        Source code for hapinf program

hapinf.out        Sample output file, hapinf program (add2 data)

hapinf.txt        Sample input file for hapinf program

htdata.out        Sample output, htSNP program

htdata.txt        Sample input for htSNP program (add2 data)

htSNP.py          Code for htSNP program

Readme.txt        Mentions web site with detailed explanations

ReadmeHapinf.txt  Explanations to hapinf program

Readme-htSNP.txt  Explanations to htSNP program

seed.txt          Contains seed for random number generator

snp.exe           Executable for snp program

snp.pas           Souce code for snp program

snpdata.txt       Sample input file to snp program

snpeh.txt         Sample output from snp program

snpfinal.txt      Sample output from snp program

snpfreepas.pas    Part of source code of snp program

snpout.txt        Sample output from snp program

 

Input files for snp program

 

1. Pool phenotypes

 

The user must prepare an input file called “snpdata.txt” with the following specifications:

 

Line 1

Number of pools used (must be the same for all data in this file). Optionally this number may be followed by arbitrary text.

 

Line 2

Gene ID (an integer number)

SNP ID (an integer number)

Phenotype code for pool 1 (integer)

Phenotype code for pool 2, etc.

 

Repeat line 2 for each new SNP and for as many genes as desired. SNPs in the same gene must have same gene ID. It is ok to use the same gene ID at different places in the file. For example, if gene ID 335 is followed by gene ID 638 and then again by gene ID 335, the program will look at this as three different genes.

 

The data matrix now consists of a number of rows (= SNPs) and columns (= pools).

 

Pool phenotype codes are as follows, with X and Y designating two SNP alleles:

 

Pool phenotype

Code

Explanation

XXXX

0

All alleles are of the X type

YYYY

1

All alleles Y type

XXYY

2

Equal number of X and Y alleles

XXXY

3

3 X alleles, 1 Y allele

XYYY

4

1 X allele, 3 Y alleles

XY?

5

X and Y alleles both present but in unknown numbers

XYZ

6

3 alleles, will be treated as N

N

-1

No data (unknown)

 

 

2. Random seed

 

For the random number generator, the user must prepare a file called “seed.txt” containing a negative integer number, i.e., the random number seed. This file will be updated with each successive run of the snp program.

 

Running the snp program

Open a Windows command window (“DOS box”) and change directories until you are in the directory containing the snp program and the input files. Then simply type snp. The program will then impute genotypes as described in [1]. A given SNP will be considered missing entirely (and will not be processed further) if the number of missing pool phenotypes is larger than the highest frequency of any of the known phenotypes at that SNP.

           The program then displays program constants, which are currently set as follows:

Max. number of pools = 100

Max. number of SNPs per gene = 200

Max. number of genotype vectors per gene = 30,000

Max. number of interations = 99. If successive iterations differ by ss < 10-11 then iterations will stop early, where

           ss = sum of squared deviations between old and new genotype vector frequencies.

 

As the program is running it announces each gene and the number of valid (non-missing) SNPs at that gene, also the total number of genotype vectors compatible with the pool phenotypes. If that number exceeds the maximum set by the corresponding program constant, the given SNP will be skipped (it would take too long to run anyway). Then the program reports progress at each iteration and displays the sum of squared deviations (ss) of genotype vector frequencies between the current and the previous iteration.

Output files

The snp program writes three output files, “snpout.txt”, “snpfinal.txt”, and “snpeh.txt”.

 

1. snpout.txt

This file will contain “cleaned” data with imputed pool phenotypes and an indication of which SNPs are missing.

 

2. snpfinal.txt

This file contains, for each gene, a list of estimated individual genotypes. The following items (columns) will be provided for each individual (output line):

           Gene number

           Genotype code for SNP 1

           Genotype code for SNP 2, etc.,

where codes 1, 2, and 3 represent genotypes 1/1, 1/2, and 2/2, respectively. After some modification as indicated, this file is suitable for input to the hapinf program (see below).

 

3. snpeh.txt

This file is analogous to the previous file but is formatted for input to the EHplus program [2]. The following items (columns) are provided for each individual (output line):

           ID for individual. Example: 3.119 (meaning individual 3 at gene 119)

           A fixed code of 0 meaning “control individual”

           Allele 1 at SNP 1

           Allele 2 at SNP 1

           Allele 1 at SNP 2

           Allele 2 at SNP 2, etc.

 

Running the hapinf program

The program has been developed by Dr. Andrew Clark [4]. He graciously allowed me to include his program with my programs in this package.

The program requires an input file called “hapinf.txt” (for an example see Analysis of sample data, below). Then run the hapinf program by typing hapinf in the command window. The program will produce an output file called “hapinf.out”.

 

Running the htSNP program

This program is written in the Python language. To run it you must have the Python system installed, for example, on your PC under Windows. See http://www.python.org/. For a brief list of commands, type htSNP.py. The input file consists of a number of rows (haplotypes) and columns (SNPs), with an optional header line containing any text. Entries in the input data matrix are alleles coded as 0 and 1.

 

Analysis of sample data (add2 gene)

This sample dataset consists of 16 pools genotyped at 25 SNPs (“add2data.txt” file). In [1] we use only the first 15 SNPs, resulting in the “snpdata.txt” file (input to snp program). Run the snp program by typing snp.

The resulting output file, “snpfinal.txt”, is then modified (first 7 lines deleted) and saved as “hapinf.txt”, which serves as input to the hapinf program. Run it by typing hapinf.

The resulting output file, “hapinf.out”, after the title SampleID HapCode Haplotype, contains haplotypes assigned to individuals. In ambivalent situations more than two haplotypes may be shown for a given individual (this will be remedied in a future program version), in which case the user may want to use his judgment as to which haplotypes to assign. For the given dataset, no ambiguities exist. The last section contains the list of unambiguously assigned haplotypes. To determine haplotype-tagging SNPs, save that section of text, for example, as “htdata.txt”, but delete the first two columns (Code and Number). Then type the command htSNP.py htdata.txt htdata.out. The “htdata.out” file then shows that 6 is the smallest number of haplotype tagging SNPs, and there are 10 different such sets of SNPs.

 

References

[1] Hoh J, Matsuda F, Peng X, Markovic D, Lathrop MG, Jurg Ott J: SNP haplotype tagging from DNA pools of two individuals. BMC Bioinformatics 2003 4:14

[2] Zhao JH, Curtis D, Sham PC: Model-free analysis and permutation tests for allelic association. Hum Hered 2000, 50:133-139

[3] http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt

[4] Clark AG: Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 1990, 7:111-122