Prediction of Complete Gene Structures in Human Genomic DNA
C. Burge, S. Karlin
Department of Mathematics, Stanford University, CA 94305, USA.
Journal of Molecular Biology
268(1):78-94 (April 25, 1997)
Abstract
We introduce a general probabilistic model of the gene
structure of human genomic sequences which incorporates
descriptions of the basic transcriptional, translational and
splicing signals, as well as length distributions and
compositional features of exons, introns and intergenic
regions. Distinct sets of model parameters are derived to
account for the many substantial differences in gene density
and structure observed in distinct C + G compositional
regions of the human genome. In addition, new models of
the donor and acceptor splice signals are described which
capture potentially important dependencies between signal
positions. The model is applied to the problem of gene
identification in a computer program, GENSCAN, which
identifies complete exon/intron structures of genes in
genomic DNA. Novel features of the program include the
capacity to predict multiple genes in a sequence, to deal with
partial as well as complete genes, and to predict consistent
sets of genes occurring on either or both DNA strands.
GENSCAN is shown to have substantially higher accuracy
than existing methods when tested on standardized sets of
human and vertebrate genes, with 75 to 80% of exons
identified exactly. The program is also capable of indicating
fairly accurately the reliability of each predicted exon.
Consistently high levels of accuracy are observed for
sequences of differing C + G content and for distinct groups
of vertebrates.