Jurg Ott / 11 July 2011
Beijing Institute of Genomics
Disease-Associated Genotype Patterns
This
document refers to our publication (Long et al 2009) on
estimating
genotype patterns (diplotypes) and testing for frequency
differencies between case and control individuals. Here is a brief
description on how to use our software called randompat,
RP. It is currently
available only for Windows PCs, but the source code (available from me)
may be compiled for Linux PCs with the Free Pascal compiler. A
Ukrainian (Belorussion?) translation of this document is available here.
Installation
- Download
the software package
and in a suitable
folder extract
all files. One of these files is a sample dataset, ZeeData.txt.
with 88 SNPs and 779 individuals (cases plus controls). While this
dataset is small, the RP program is preferably run on genome-wide
case-control datasets (SNP markers).
- Open a command window ("DOS box"), cmd. Change
directories (folders)
until you are in the folder containing the randompat files.
- The program requires two input files, a
parameter file (for
example, RPparamZee3.txt)
and a datafile (for example, ZeeData.txt).
- To run the program with these two input files,
type the
command, goRP Zee3
(note
the space between goRP
and Zee3).
You should then see various intermediate program output and, in the
end, a note saying that the final output has been stored in the file, RPresultsZee3.txt.
- In the current program design, the number m
of SNPs,
for which genotype patterns can be constructed, is limited to m
<
10.
Input Files
As
mentioned above, there are two input files, a parameter file and a
datafile. The sample parameter file provides a brief
description on how to set up this file. The datafile must have the
structure of a sumstat
file and may or may not contain chromosomal information for the SNPs.
Briefly, rows in the datafile correspond to SNPs and columns represent
individuals, while the body of the file contains genotype codes, for
example, 1 = AA, 2 = AB, 3 = BB, 0 = unknown. The last row contains
indicator codes for disease status, for example, 1 = control, 2 = case
(affected). The last three columns are optional and may specify
chromosome number, position, and marker identifier.
Your data may be in plink
format, in which case you may use the included program, p2s, to convert the
plink files
to sumstat
format. Using plink
and assuming you have files called data.ped
and data.map,
you first must produce transposed datasets with the command,
plink
--file data --transpose --recode12
Then run the conversion program, p2s,
and follow instructions.
Interpretation of Output
As described in our publication (Long et al 2009), the randompat program
picks m
SNPs on the basis of their individual significance for association.
This can be done with the allele test (based on 2x2 tables of alleles)
or the genotype test. Naturally, the order of SNPs picked is generally
different depending on which test statistic is used to pick SNPs. The
number m
of SNPs for which genotype patterns are formed is an input quantity.
Two parameter files are included in this package, RPparamZee2.txt and
RPparamZee3.txt.
They differ in the test type used to pick SNPs. The sample dataset
provided here was previously described (Hoh et al 2001).
Running the RPparamZee3.txt
file produces the following output:
Program
RANDOMPAT version 11 Nov 2009
Sample data
Input file =
ZeeData.txt
Number of
observations = 779
Number of SNPs = 88
Pattern is rare when
exp #obs < 1.00 in cases or controls
Number of
permutations = 10000
SNPs picked by
genotype test. Lambda = 1.0000
=== Observed data,
best 2 SNPs ===
TestSNP
seq.#
chi-sq
p-value chr position name
1
24 9.5254
8.5425E-003
5
231
CD14_05q31
2
75 7.8312
1.9928E-002
17
113
TP53-2_17p13
779 of 779
individuals showed complete patterns
Observed table of
genotype patterns and odds ratios
Controls
Cases Pattern OR
95 112 2
3 1.753
101 73 2
2 0.903
59 27 1
2 0.549
63 30 3
3 0.571
55 59 1
3 1.448
40 27 3
2 0.851
11
8 2 1 0.928
8
1 3 1 0.157
5
5 1 1 1.282
sum
437 342 Total = 779
p = 5.0810E-004 for
table of genotype patterns
=== Randomized
adjusted p-values ===
Test SNP
1 5123/10000 = 5.1230E-001 = 0.5123
Test SNP
2 5118/10000 = 5.1180E-001 = 0.5118
Table 1272/10000 = 1.2720E-001 = 0.1272
Initial seed =
6668993751905812589
*** Sun 15 Nov
2009 4:36:47h ***
*** Sun 15 Nov
2009 4:36:55h ***
For this run, the program picked the two SNPs with smallest p-values in the
genotype association test (chi-square with 2 df), then formed genotype
patterens and listed all patterns with an expected number of
observations >1 in cases and controls each. It turns out that
all possible 3 x 3 = 9 patterns occur in this dataset. The pattern with
strongest disease association, judged by the odds ratio, OR = 1.753, is
2-3 (AB-BB) at test SNPs 1 and 2, respectively. On the other hand, the
pattern showing the strongest association with absence of disease is 0.157
(1/OR = 6.37). Based on numbers of observations, one should now compute
confidence
intervals on these ORs in order to interpret them reasonably.
For example, for the 2-3 pattern, numbers of observations may be
displayed as follows:
|
cases |
controls |
| pattern 2-3 |
112 |
95 |
| other patterns |
230 |
342 |
| total |
342 |
437 |
The 2BY2
program furnishes OR = 1.753 and an associated 95% confidence
interval of (1.272, 2.415).
References
Hoh, J.,
Wille, A., and Ott, J. 2001. Trimming, weighting, and grouping SNPs in
human
case-control association studies. Genome
Res 11(12): 2115-2119.
Long, Q.,
Zhang, Q., and Ott, J. 2009. Detecting disease-associated genotype
patterns. BMC Bioinformatics 10(Suppl 1): S75.