Interpolated Markov Chains for Eukaryotic Promoter Recognition
Uwe Ohler1,2, Stefan Harbeck1,
Heinrich Niemann1, Elmar Nöth1
and Martin G. Reese2
1Chair for Pattern Recognition (Computer Science V), University of Erlangen-Nuremberg,
Martensstraße 3, D-91058 Erlangen, Germany and
2Department of Molecular and Cell Biology,
University of California at Berkeley, 539 Life Sciences Addition, Berkeley, CA
94720-3200, USA
Bioinformatics,
15(5): 362-369 (May 1999)
Abstract
Motivation: We describe a new content-based approach for the detection of promoter regions
of eukaryotic protein encoding genes. Our system is based on three interpolated Markov
chains (IMCs) of different order which are trained on coding, non-coding and promoter
sequences. It was recently shown that the interpolation of Markov chains leads to stable
parameters and improves on the results in microbial gene finding (Salzberg et al., Nucleic
Acids Res., 26, 544-548, 1998). Here, we present new methods for an automated estimation
of optimal interpolation parameters and show how the IMCs can be applied to detect
promoters in contiguous DNA sequences. Our interpolation approach can also be employed
to obtain a reliable scoring function for human coding DNA regions, and the trained models
can easily be incorporated in the general framework for gene recognition systems.
Results: A 5-fold cross-validation evaluation of our IMC approach on a representative
sequence set yielded a mean correlation coefficient of 0.84 (promoter versus coding
sequences) and 0.53 (promoter versus non-coding sequences). Applied to the task of
eukaryotic promoter region identification in genomic DNA sequences, our classifier identifies
50% of the promoter regions in the sequences used in the most recent review and comparison
by Fickett and Hatzigeorgiou (Genome Res., 7, 861-878, 1997), while having a false-positive
rate of 1/849 bp.
Contact:
ohler@informatik.uni-erlangen.de