A Clean Data Set of EST-confirmed Splice Sites from
Homo Sapiens and Standards for Clean-up Procedures
T. A. Thanaraj
Nucleic Acids Research , 27(13):2627-2637 (1999)
Abstract
A clean data set of verified splice sites from Homo sapiens are reported as well as the
standards used for the clean-up procedure. The sites were validated by: (a) Standard
cleaning procedures such as requiring consistency in the annotation of the gene
structural elements, completeness of the coding regions and elimination of redundant
sequences; (b) Clustering by decision trees coupled with analysis of ClustalW
alignments of the translated protein sequence with homologous proteins from
SWISS-PROT; (c) Matching against human EST sequences. The sites are categorised
as: (i) Donor sites - a set of 619 EST-confirmed donor sites, for which 138 are either
the sites or the regions around the sites involved in alternative splice events; (ii)
Acceptor sites - a set of 623 EST-confirmed acceptor sites, for which 144 are either
the sites or the regions around the sites are involved in alternative splice events; (iii)
Genuine splice sites - a set of 392 splice sites wherein both the donor and acceptor
sites had EST confirmation and were not involved in any alternative splicing; (iv)
Alternative splice sites - a set of 209 splice sites wherein both the donor and acceptor
sites had EST confirmation and the sites or the regions around them were involved in
alternative splicing. A set of nucleotide regions that can be used to generate a control
set of false splice sites that have a high confidence of being non-functional are also
reported.