Jurg Ott / 11 July 2011
Beijing Institute of Genomics

Disease-Associated Genotype Patterns

This document refers to our publication (Long et al 2009) on estimating genotype patterns (diplotypes) and testing for frequency differencies between case and control individuals. Here is a brief description on how to use our software called randompat, RP. It is currently available only for Windows PCs, but the source code (available from me) may be compiled for Linux PCs with the Free Pascal compiler. A Ukrainian (Belorussion?) translation of this document is available here.

Installation

Input Files

As mentioned above, there are two input files, a parameter file and a datafile. The sample parameter file provides a brief description on how to set up this file. The datafile must have the structure of a sumstat file and may or may not contain chromosomal information for the SNPs. Briefly, rows in the datafile correspond to SNPs and columns represent individuals, while the body of the file contains genotype codes, for example, 1 = AA, 2 = AB, 3 = BB, 0 = unknown. The last row contains indicator codes for disease status, for example, 1 = control, 2 = case (affected). The last three columns are optional and may specify chromosome number, position, and marker identifier.

Your data may be in plink format, in which case you may use the included program, p2s, to convert the plink files to sumstat format. Using plink and assuming you have files called data.ped and data.map, you first must produce transposed datasets with the command,
  plink --file data --transpose --recode12
Then run the conversion program, p2s, and follow instructions.

Interpretation of Output

As described in our publication (Long et al 2009), the randompat program picks m SNPs on the basis of their individual significance for association. This can be done with the allele test (based on 2x2 tables of alleles) or the genotype test. Naturally, the order of SNPs picked is generally different depending on which test statistic is used to pick SNPs. The number m of SNPs for which genotype patterns are formed is an input quantity. Two parameter files are included in this package, RPparamZee2.txt and RPparamZee3.txt. They differ in the test type used to pick SNPs. The sample dataset provided here was previously described (Hoh et al 2001).

Running the RPparamZee3.txt file produces the following output:

Program RANDOMPAT version 11 Nov 2009

Sample data
Input file = ZeeData.txt
Number of observations = 779
Number of SNPs = 88
Pattern is rare when exp #obs <  1.00 in cases or controls
Number of permutations = 10000
SNPs picked by genotype test. Lambda = 1.0000

=== Observed data, best 2 SNPs ===
TestSNP   seq.#      chi-sq     p-value chr   position  name
      1      24      9.5254 8.5425E-003   5        231 CD14_05q31
      2      75      7.8312 1.9928E-002  17        113 TP53-2_17p13
779 of 779 individuals showed complete patterns

Observed table of genotype patterns and odds ratios
Controls   Cases  Pattern  OR
      95     112  2 3      1.753
     101      73  2 2      0.903
      59      27  1 2      0.549
      63      30  3 3      0.571
      55      59  1 3      1.448
      40      27  3 2      0.851
      11       8  2 1      0.928
       8       1  3 1      0.157
       5       5  1 1      1.282
sum  437     342  Total = 779
p = 5.0810E-004 for table of genotype patterns

=== Randomized adjusted p-values ===
Test SNP 1    5123/10000 = 5.1230E-001 = 0.5123
Test SNP 2    5118/10000 = 5.1180E-001 = 0.5118
     Table    1272/10000 = 1.2720E-001 = 0.1272

Initial seed = 6668993751905812589
*** Sun 15 Nov 2009   4:36:47h  ***
*** Sun 15 Nov 2009   4:36:55h  ***

For this run, the program picked the two SNPs with smallest p-values in the genotype association test (chi-square with 2 df), then formed genotype patterens and listed all patterns with an expected number of observations >1 in cases and controls each. It turns out that all possible 3 x 3 = 9 patterns occur in this dataset. The pattern with strongest disease association, judged by the odds ratio, OR = 1.753, is 2-3 (AB-BB) at test SNPs 1 and 2, respectively. On the other hand, the pattern showing the strongest association with absence of disease is 0.157 (1/OR = 6.37). Based on numbers of observations, one should now compute confidence intervals on these ORs in order to interpret them reasonably. For example, for the 2-3 pattern, numbers of observations may be displayed as follows:

cases controls
pattern 2-3 112 95
other patterns 230 342
total 342 437

The 2BY2 program furnishes OR = 1.753 and an associated 95% confidence interval of (1.272, 2.415).

References

Hoh, J., Wille, A., and Ott, J. 2001. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11(12): 2115-2119.

Long, Q., Zhang, Q., and Ott, J. 2009. Detecting disease-associated genotype patterns. BMC Bioinformatics 10(Suppl 1): S75.