1. INSTALLATION OF SOFTWARE Step 1. Install R download R at: http://cran.r-project.org/ click "R binaries", then "windows", then "base", then "rw1061.exe", then install it on windows. Step 2. Source the codes into R Run R, you will see a command window with the prompt "> ". In the menu "File", select "source R code", select file EHM.R. Then the functions in file EHM.R are sourced into R. At the command window, type > ls() you will see: [1] "add.locus" "bin" "eh" "eh1" "ehv" "factorial" [7] "get.haplo" "haplo" "p2h" /* These are the functions needed to caculate the haplotype frequencies. The main function is "eh", "ehv" is almost equivalent to "eh" except it provides variances estimate ("eh" does not provide variances), "eh1" is for non-pooling case only */ Step 3. Import data from files. /* Basically there are two data files are needed. One is the genotype data, another is the list of all the pool sizes for all the pools. If all the pools are of the same size, say all pools are of size 3, then only genotype file is needed, and in implementing the function eh, specify K=3 (see step 4) */ Suppose you have these two data files, "geno.txt", "size.txt", (see below for an example of these two data files) in directory C:\snp\haplo, then import the data into R objects, and assign them to R objects, say, geno and p.size (In R or Splus, data, functions are called objects): > geno <-as.matrix( read.table("C:/snp/haplo/geno.txt") ) > p.size <- scan("C:/snp/haplo/size.txt") To look at the data, type the name of the objects: > geno > p.size Step 4. Implement function "eh": > eh(geno, p.size) this produces the haplotype frequency estimates and the log-likelihood. if you need to get the variances of these estimates, run function "ehv": > ehv(geno, p.size) if all pools are of the same size, then need only to specify K in invoking function eh. > eh(geno, K=3) 2. GENOTYPE DATA FORMAT --Two data sets (files) for genotypes and pool sizes. Genotype data is a matrix of genotypes with n rows and m columns, n is the number of pools and/or individuals, m is the number of loci(SNP). Each entry is converted to numeric number from the genotype data. File of pool sizes is a vector of pool sizes for each row (pool or individual). --Arguments of function "ehv" and "eh": geno, K, pool.size (1) geno: genotype matrix, rows are pools, columns are SNPs non-pooling: genotypes AA, AG, GG are converted to 0 1 2 (number of G) 2-pooling: genotypes AAAA, AAAG, AAGG, AGGG, GGGG are converted to 0, 1, 2, 3, 4 (number of G) complete missing value is indicated by -1, partial missing is indicated by -2. (2) K: pool size if all the pools are of equal sizes. (3) pool.size: vector. the i-th component is the pool size for the i-th pool. For example, if pool.size=c(1,4,2,1,1), it means the 1nd row of the data "geno" is individual sample, 2nd row is a pool of 4 individuals, the 3rd row is a pool of 2 individuals, rows 4 and 5 of data geno are individuals. --The two data files should be read into R as R objects (see step 3). Example data files: geno.txt: 4 2 3 4 0 1 2 0 0 2 -1 2 4 4 1 6 4 5 0 1 1 0 1 2 1 0 0 1 0 2 size.txt 2 1 2 3 1 2 there are 6 pools/rows and 5 SNPs/columns, the pool size are respectively 2 1 2 3 1 2. for example, pool 1 is a pool of two individual DNA, the genotypes at the 5 SNPs are 4 2 3 4 0. 3. OTHER FEATURES/FUNCTIONS: eh1 and LD.pair eh1: Function eh1 can be used for individual genotype data (K=1, no pool), running eh1(geno) produce the same results as running eh(geno, K=1) or ehv(geno, K=1), but eh1 is much faster. LD.pair: Function LD.pair can be used to calculate the pairwise LD coefficients from the haplotype frequencies (the output from eh/ehv).