Estimate Haplotype Frequency from Mixed Sample of Pooled and Individual DNA DESCRIPTION: Observations are genotypes of individual DNA samples and pooled DNA samples. Each observation is genotypes of K- pooled DNA sample (K=1 corresponds to non-pooling), different pools may have different pool size K. Genotypes are the total number of major variant (allele) in the pool for each locus. At each locus there is two possible alleles (biallelic), one is major allele of interests (say, mutation), another is minor allele. The two alleles at each locus is denoted by 0 and 1, genotype is therefore the number of allele 1 and haplotypes are binary sequences.Missing information is allowed in the algorithm. There are two types of missing values, one is complete missing, another is partial missing. Partial missing means that the exact genotype at one specific locus was not observed, but instead, it is known that at this locus, there are at least one major allele and at least one minor allele. USAGE: ehv(geno, K=2, pool.size) REQUIRED ARGUMENTS: geno: n by m matrix. Genotype observations with rows being pools and columns loci. Columns are corrsponding to the loci considered. There are n pools/individuals and m loci. The number of possible haplotypes is 2^m. The (i,j) entry is the genotype for the i-th pool at locus j. Genotype is defined as the number of major allele in the pool. For example, if we observed AGGG and define arbitrarily allele A as major allele, then the genotype is 1, otherwise if allele G is defined as major allele, the genotype is 3 at this locus. Complete missing genotypes are coded as -1, partial missing values are coded as -2. K: pool size when all pools are of same size. Default value is 1 (for individual DNA sample). pool.size: vector of pool size (K1, K2, ..., Km). The i-th component, Ki, is the pool size for pool i. When all the components are 1, there is no pooling of DNA samples. Default values are (1,1,...,1), i.e., each pool is in fact genotype for one individual. VALUE: h: estimated haplotype relative frequencies. logL: the logarithm maximum likelihood. V: Variance matrix of the estimated haplotype frequencies. config.h: estimated haplotype relative frequencies and their variance estimates (the last two columns). Other columns are their corresponding haplotype configurations of binary (0/1) sequences. SIDE EFFECTS: the program is slow when the loci number exceeds 10, or pool size exceeds 8 or there are too many missing information in the data set. DETAILS: The maximum likelihood estimates of the haplotype frequencis are obtained by EM algorithm under the assumption of Hardy-Weinberg equilibrium and random mating. Ideally the method can be applied to cases with any arbitrary pool sizes (K) and locus numbers (m), but the compuatation burden is very heavy even for moderate K and m. If the sample size (number of pools), n (number of columns of genotype data), is small, the variance estimates may not be accurate. In this case, a simplified version of ehv, eh may be used and the bootstrapping method is recommended for estimating the variances. Logarithm of maximum likelihood (logL) can be used as in constructing likelihood ratio test for, for example, case-control studies. REFERENCES: T.A. Louis, 1982, Finding the observed information matrix when using EM algorithm, J.R.Statisti. B, 44:226-233. K. Lange, 1999, Numerical Analysis for Statisticians, Springer. SEE ALSO: eh, eh1 (for individual sample only). EXAMPLES: Example 1. If the genotype data are on 2 SNPs (L2 and L2) for 5 pools, each of the pools are respectively pools of size 2,1,2,1,3: L1 L2 pool-1 1 4 pool-2 -1 0 pool-3 2 3 pool-4 1 2 pool-5 2 5 ## input data: > geno.data <-matrix(c(1,4,-1,0,2,3,1,2,2,5),5,2,byrow=T) > p.size_c(2,1,2,1,3) #pool sizes ## run the program: > ehv(geno=geno.data, pool.size=p.size) ## if all the pool sizes are equal, then need only to specify K. > ehv(geno=geno.data, K=2) ## K=2 > eh(geno=geno.data, K=3) ##K=3 Example 2. Data from files. Two files are needed. The first is the genotype file, in which there are n rows (pools) and m columns (SNPs or loci). Each entry is a integer standing for the total number of major allele. Another file contains a single column of the pool sizes for each pool. For example, there are 4 pools and 5 SNPs, the 4 pools are pooled DNA from 2,1,2,3 individuals, then the genotype data file, "geno.txt" should look like this 4 2 3 4 0 1 2 0 0 2 3 2 4 4 1 6 4 5 0 1 and the file, "sizes.txt" of pool sizes should be 2 1 2 3 In R, input commands: > geno <- as.matrix(read.table("geno.txt")) # read in genotype data. > size <- scan("sizes.txt") # read in the pool sizes. > eh.out <- ehv(geno, size)