Human Molecular Genetics, 2002, Vol. 11, No. 4 451-464
© 2002 Oxford University Press
Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human
Advanced Computation and Modeling Center, University of Queensland, St Lucia, 4072, Australia and 1European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received October 26, 2001; Revised and Accepted December 6, 2001.
| ABSTRACT |
|---|
|
|
|---|
By spliced alignment of human DNA and transcript sequence data we constructed a data set of transcript-confirmed exons and introns from 2793 genes, 796 of which (28%) were seen to have multiple isoforms. We find that over one-third of human exons can translate in more than one frame, and that this is highly correlated with G+C content. Introns containing adenosine at donor site position +3 (A3), rather than guanosine (G3), are more common in low G+C regions, while the converse is true in high G+C regions. These two classes of introns are shown to have distinct lengths, consensus sequences and correlations among splice signals, leading to the hypothesis that A3 donor sites are associated with exon definition, and G3 donor sites with intron definition. Minor classes of introns, including GC-AG, U12-type GT-AG, weak, and putative AG-dependant introns are identified and characterized. Cassette exons are more prevalent in low G+C regions, while exon isoforms are more prevalent in high G+C regions. Cassette exon events outnumber other alternative events, while exon isoform events involve truncation twice as often as extension, and occur at acceptor sites twice as often as at donor sites. Alternative splicing is usually associated with weak splice signals, and in a majority of cases, preserves the coding frame. The reported characteristics of constitutive and alternative splice signals, and the hypotheses offered regarding alternative splicing and genome organization, have important implications for experimental research into RNA processing. The AltExtron data sets are available at http://www.bit.uq.edu.au/altExtron/ and http://www.ebi.ac.uk/~thanaraj/altExtron/.
| INTRODUCTION |
|---|
|
|
|---|
Although the mechanisms that generate multiple protein isoforms from a single gene occur during transcription, through processing of transcripts, at translation and then post-translation, it has become clear that alternative splicing (during the processing of transcripts) is a major contributor. Alternative splicing can be frequent (13) and estimates of alternative splicing in as many as 60% of human genes have been reported [47; also see Thanaraj, T.A. (2000) Extent of alternative splicing observed in human genes at http://www.ebi.ac.uk/~thanaraj/gene_altSplice.html]. Alternative splicing can often be specific to tissue type, developmental stage or physiological conditions. Further, it has been estimated that up to 15% of human genetic disease is caused by point mutations that occur at or near splice junctionswhere they most likely promote aberrant splicing (8).
Removal of intron sequences from pre-mRNA is carried out by the spliceosome, a RNP complex that contains five snRNA molecules and a large number of snRNA-associated proteins as well as other protein factors. Most introns start with GT and end with AG (so-called GT-AG introns), the well-known exceptions being GC-AG and AT-AC. To date, two types of metazoan spliceosome have been identifiednamely an abundant U2-type (named after the U2 snRNA that base pairs with the donor site) and a minor U12-type (named after the U12 snRNA equivalent of U2 snRNA). While the U2-type processes GT-AG (and GC-AG) introns, the U12-type processes the minor AT-AC and the so-called U12-type GT-AG introns (9,10). The spliceosome is assembled upon the pre-mRNA sequence in a stepwise manner through interactions of its RNA and protein components with specific recognition sequences located on the pre-mRNA at the donor splice site, branch point, and the acceptor splice site [which includes a polypyrimidine tract (PPT) and the terminal AG dinucleotide]. These splice signals are composed of short sequences that are recognized in combination. Deviation from consensus can result in an overall decreased affinity for the spliceosome, and such introns and exons may require additional cis-acting RNA sequence elements (called splicing enhancers) for efficient splicing. Alternative splicing is signaled by various weak signals (including regulatory sequences) and, to understand the underlying molecular mechanisms, it is necessary to have a detailed understanding of the primary splicing signals, the correlations between these elements, and the effect on these elements of such global contexts as G+C composition and the phase (or frame) of successive exons in the gene.
In this work we generate and analyze a high quality data set of EST and mRNA confirmed constitutive and alternatively spliced introns and exons from human. The observed alternative events are described as: (i) cryptic (or skipped) exons, where an entire alternative (constitutive) exon is seen in some transcripts but not in others (i.e. a cassette exon); (ii) exon/intron isoforms, where use of alternative donor or acceptor splice sites leads to truncation/extension of exons/introns; and (iii) intron retention. The systematic categorization and characterization of normal and alternative exons and introns presented in the AltExtron data set may be expected to lead to both computationally useful rules for the prediction of multiple gene isoforms, and to the design of experiments that will help elucidate the biochemical mechanisms underlying the regulation of splicing.
| RESULTS |
|---|
|
|
|---|
Part I: characterization of confirmed introns and exons
As described in Materials and Methods, genes and transcripts extracted from GenBank (release 117) (11) were aligned under stringent criteria. These alignments were examined to identify transcript-confirmed introns and exons. For an intron to be considered as confirmed, consistently positioned GT-AG or GC-AG terminal dinucleotides must be observed in the spliced alignment (Fig. 1). An exon is considered as confirmed when a match between a transcript and a gene is flanked on both sides by confirmed introns. The alignments allow the identification of 16 269 introns and 11 812 exons in 2793 genes. In this section we examine some gross properties of the intron and exon data sets, including the translatability of the exons, the distribution of intron phase, the identification and classification of splice signals, and in particular, the relationship between these attributes and G+C content.
|
The distribution of G+C content in the 2793 genes in our data set, with few exceptions, ranged from 30 to 70%, and appeared weakly bimodal, with one mode at 42% and the other at 54%. For descriptive convenience we partition the data set into genes with G+C contents of
49% (GC-high) and those with G+C contents of <49% (GC-low). It is known that genes in regions of low G+C content tend to be longer than those in high G+C regions (12), and we observe this to be the case in our data set (median lengths of 10 458 nt in the GC-low partition and 3284 nt in the GC-high partition). We observed that the exons in these two groups have similar mean lengths at 142 and 137 nt, respectively, and that it is the lengths of the introns that differ (median length of 1597 nt in the GC-low partition and 462 nt in the GC-high partition).
Translatability in multiple frames
We examined the ability of the exons in our data set to translate (without stop codons) in each of the three possible sense frames. Considering only the exons observed within annotated coding sequence (CDS) we found that 36.2% of them can code in more than one frame, with 6.2% coding in all three frameswe call exons that can code in more than one frame multi-frame. For exons that code in two frames, we find a preference for the alternative frame to have the first codon position in the middle position of the constitutive frame codon (57 to 43%). It is further observed that the occurrence of exons that code in more than one frame is highly correlated with G+C content, and is well described by the linear model shown in Figure 2. This observation is explicable, at least in part, by the fact that the stop codons (TAG, TAA and TGA) are themselves A+T-rich, and are thus less likely to occur by chance as G+C content increases (Fig. 2, insert). The multi-frame exons tend to be of shorter than average length, with a mean length of 108 nt for those that translate in two frames and 82 nt for those that translate in all three frames.
|
We examined two null models of exon translatability, one based on random association of random codons (excluding stop codons), and another with codon usage weighted to reflect observed codon usage in human sequence (13). These models were integrated over the exon length distribution observed for coding exons in the data set, and both gave expected values of 28% for the number of multi-frame exons. The 36% of exons observed to be multi-frame might represent a modest increase on that expected by chance alone. Further, the following analysis demonstrates that the manner in which multi-frame exons are associated is not a product of chance alone. We identified cases where a confirmed intron was flanked at both ends by confirmed exons that were within annotated CDS (8774 cases), and from this set we isolated those instances where both exons translated in precisely two of the three possible frames (934 cases). Of these we observe a 534/400 split between those exon pairs where the alternative frames match, and those where they do not. We interpret this to indicate that some alternative frames either do, or have in the past, code(d) for protein. Note that the preference for a second coding frame to have codon position one in the middle base of the constitutive frame codon does create a small bias; however this bias can only explain an expected split of 476/459 (and so we arrive at a
2 value of 14.6, making the above observation on the matching of alternative frames significant at a level >99.9%).
Analysis of intron phase
An intron contained within CDS is said to have a phase of zero if the intron demarcates a codon boundary, a phase of one if it divides a codon between the first and second nucleotides, and a phase of two if the intron divides a codon between the second and third nucleotides. The position of an exon with respect to the codon positions can be defined by the phases of the upstream and downstream flanking introns, and when an exon is flanked by introns of the same phase it will be a multiple of three nucleotides in lengthwe call such exons modular. The intron phase distribution in our data set is 46% phase 0, 32% phase 1 and 22% phase 2, with the association of intron phases across exons such that 40% of exons are modular, and 23% contain a whole number of whole codons (Supplementary Material). These results are in agreement with previous reports (1419). Also, nucleotide biases in the three codon positions act to modulate the consensus of splice sites at the exon nucleotides in a phase specific manner, and this results in phase zero introns having better consensus at the exon positions of both donor and acceptor splice sites (Supplementary Material).
Classification of introns by donor splice site
Donor splice sites must have a certain level of consensus in order to be recognized effectively by the splicing machinery. In order to examine and classify the introns in our data set, we scored the individual donor (and acceptor) sites using a weight matrix approach, as described in Materials and Methods. This results in an information content score (units of bits) for each donor spice site, which is essentially a measure of the binding affinity for U1 snRNA. The GT donor sites had an average strength of 8.2 bits with a SD of 2.3 bits, while the GC donor sites had an average strength of 8.8 bits with a standard deviation of 4.5 bits (note that all sites score 4 bits for the presence of the compulsory dinucleotide). The scoring of all acceptor sites resulted in a mean score of 9.3 bits with a standard deviation of 3.3 bits. Motivated by the discussion in Rogan et al. (20) we decided to classify donor sites scoring less than 3 bits as weak and to examine them further. There are 294 introns containing such weak donor sites. Some of these donor sequences displayed the consensus sequence for U12-type introns, but with GT and AG as start and end dinucleotides instead of AT and ACso called U12-type GT-AG introns (10,21). We re-classified as U12-type GT-AG introns those weak introns that had at least a 5/7 match with donor positions +3 to +9 of the U12 donor consensus (ATCCTTT). This resulted in 60 introns being classified as U12, 45 of which become our curated set after manual examination of branch point signals (BPSs) (against the consensus sequence TTCCTTRaCTCR) (21,22). It is noted that only three of these 45 are described in the compilation of U12 introns given by Burge et al. (21) and Wu and Krainer (23). We also comment that only 10 AT-AC introns were observed in the construction of the data set; these were not included in the analysis, although they have been listed within the AltExtron data sets.
The remaining GT-AG introns were partitioned into three groups based on the nucleotide present at donor position +3, giving the G3 group (guanosine), the A3 group (adenosine) and the Y3 group (cytosine or thymidine/uridine). Thus we arrive at the categorization of donor splice sites shown in Table 1. It is of immediate note that the occurrence of G3 and A3 type introns is highly correlated with gene G+C content, and we find that this relationship is well described by a linear model, as shown in Figure 3. The correlations shown in the Figure highlight the pervasive manner in which G+C content can affect the distribution of splice signals. In the case of the division between the A3 and G3 intron groups (24), it is the case that very different average intron lengths are observed for these two groups although, within a narrow G+C band, the average lengths are very similar.
|
|
Sequence logos (derived using the RNA Structure Logo program) (25) showing, for each group, the consensus nucleotides around the donor and acceptor sites are given in Figure 4. This categorization, in which the major groups are defined on the basis of the +3 donor position, highlights the interplay between donor positions 1, +3 and +5. The consensus of G at +5 is much stronger in the G3 introns than in the A3 introns. In a similar manner, the positions +5 and 1 show particularly strong consensus in the Y3 introns; previous work (26) using decision tree models also identified 1 and +5 as involved in one of the two most prominent long-range dinucleotide correlations (further, these positions have been identified as maximal discriminant candidates in the analysis of donor sites) (22).
|
Consensus at acceptor site nucleotide positions
All of the acceptor splice sites in the data set have AG as the final two nucleotides of the intron, leaving only the positions 3, +1 and +2 for study. The G3, A3, Y3 and weak groups all display similar qualitative consensus with 95% C/T at acceptor position 3, 50% G at position +1 and 5561% C/T at position +2 (Table 1). These values fluctuate considerably with G+C content, in particular acceptor position 3 where the C/T division is 82/15 for the GC-high and 56/36 for the GC-low partitions, and this is partially reflected in the C/T division for the G3 versus A3 categories.
Correlations between the strength of donor and acceptor splice sites
Examination of possible correlations between the strengths of donor and acceptor sites reveals that there is, overall, a compensatory (negative) correlation between the strengths of donor and acceptor sites across exons (correlation coefficient, r = 0.034), whereas there is a positive correlation across introns (r = 0.051), as has also been noted by Zhang (27). Although these correlation values are small, the size of the data set makes them significant. These correlations are further classified in terms of exon and intron lengths and categories, and this is shown in Figure 5. While these plots allow only a crude representation of the complex interplay between exon/intron length, donor splice site type, and the splice site strengths, it is clear that the compensatory relationship between splice sites across exons is a particular property of A3 exons (r = 0.051), and not of G3 exons (r = 0.0001). A compensatory relationship across exons is to be expected on the basis of the exon definition model of splicing (27,28), and so this division is of particular interest.
|
Other important features that emerge (Fig. 5) are: (i) the positive correlation across introns is only observed for long introns (the 2440 nt and greater group), (ii) smaller exons (length <88 nt) tend to have stronger donor and acceptor sites (this is in agreement with other observations related to exon definition models in that exon size and splice site strength are additive factors in exon recognition) (28,29); and (iii) longer introns tend to have stronger donor and acceptor sites; it is known that juxtaposition of exons across long introns is a formidable problem (28) and that splicing efficiency is inversely related to intron length (30).
Analysis of PPTs and BPSs
We searched for PPTs and U2 BPSs in the terminal 100 nt of each intron. On the basis of the definitions given in Materials and Methods, we observe 73% of the introns in the data set to have a strong PPT and 39% to have a strong BPS (with 28% having both and 15% having neither). The GC-AG and U12 categories show a significant reduction in the number of strong PPTs observed (Table 1), however U12 introns are known to lack PPTs (2123), and the GC-AG introns possess strong PPTs in the constitutive introns and tend to have poor PPTs in the alternative cases (31).
Examination of the positional distribution of candidate U2 branch points (Supplementary Material) reveals 246 cases of a candidate BPS at position 4, and while this can be explained in part by base composition near the acceptor splice site, this is not a sufficient explanation. Examination of the BPS-PPT data also revealed a number of interesting facts. (i) In terms of G+C content, we observed in the GC-low partition that 75% of introns had strong PPTs and 45% had a strong BPS, compared to 72% and 34%, respectively, in the GC-high partition. This indicates that the presence of a strong PPT (as we have defined it) is not strongly affected by G+C content, whereas the converse may be said for BPSs. Further, the percentages of C/T in the strong PPTs was seen to be 27/58 in the GC-low partition and 49/38 in the GC-high partition. (ii) The exon nucleotides flanking an intron display a compensatory relationship with the BPS-PPT arrangementfor the G3 and A3 groups there are increases of between 5 and 10% in the consensus nucleotides at donor positions 3 to 1 and acceptor positions +1 and +2, from those introns with a strong PPT and BPS to those with neither. For the Y3 and weak groups there are similar compensatory increases of between 10 and 20% at acceptor positions +1 and +2. (iii) There were 2461 cases (36% of the strong BPSs) where the branch point adenosine was located 1 nucleotide upstream of a candidate PPT.
Part II: characterization of observed alternative splicing
To produce a data set of transcript-confirmed alternative exons and introns requires that they be both confirmed and alternative. An alternative exon or intron is one that overlaps with another confirmed exon or intron. In this section we detail the confirmed alternative events that were observed in the data set. At a global level, we report that, while the overall distribution of G+C contents for genes with alternative isoforms does not differ markedly from that of the entire gene data set, a breakdown of the splicing events does reveal an interesting bias. If we examine cassette exon events as one group, and exon isoform events as another group, we find that the former have a tendency to occur in genes with low G+C content, while the latter have a tendency to occur in genes with high G+C content (Fig. 6).
|
The nomenclature for describing alternative splicing events used in this paper extends on that conventionally used (32,33); in particular, (i) we categorize cassette exons as skipped or cryptic depending on their presence or absence in the constitutive form, and (ii) we categorize alternative splice site events as exon extension or truncation events. Also, it is a useful convention, and one that we follow here, for the terms extension and truncation to refer to the effect on the exon, and hence the mRNA, even when discussing intron isoforms. Further, because of the fact that our analysis swaps between examining exons and introns, it is desirable to refer to donor and acceptor splice sites, and thus avoid the confusion that can arise through the specification of splice sites as 5' or 3'. We find that skipped exon observations outnumber other events with 538 distinct skipped exons, 166 distinct cryptic exons, 351 exon extension/truncation isoforms and 57 retained introns.
Level of alternative splicing
In terms of a comparison to annotation we observe that 5% of the transcript data describe 2050 unannotated introns in 1045 genes (37%), and we note that groups of putative alternative events follow a consistent pattern of transcript coverage, with
40% being supported by two or more transcripts and
10% by five transcripts or more. This contrasts with annotated events, which show corresponding values of 75% and 30%, and this is consistent with the alternative forms representing minor isoforms.
In terms of confirmed alternative events, where transcript-confirmed introns and exons overlap, we observe 796 genes (28%) to have alternative isoforms. This compares with published estimates of alternative splicing of 35% (1), 38% (2) and 22% (3), but contrasts with the high-end estimates (47). Our low-end figure is a consequence, at least in part, of the stringent criteria used whereby only those alternative events that could be unambiguously assigned have been included.
Cassette exonscryptic and skipped exons
It is useful to differentiate between cassette exons that are skipped (a skipped exon being a constitutive exon that is skipped in the alternative form) and those that are cryptic (a cryptic exon being one that is absent in the constitutive form). For the practical purpose of making this distinction we utilize the annotated gene structure, and define skipped and cryptic exons in terms of the covering intron. If a covering intron incorporates an entire annotated exon, any covered exon is considered to have been skipped, otherwise it is cryptic (this definition does lead to some cases of double counting when a cassette exon is covered by introns of both types). The numbers of cryptic and skipped exons, including the transcript coverage statistics, are given in Table 2, and it is seen that the coverage statistics are completely consistent with the defined cryptic exons being absent, and skipped exons being present, in the constitutive isoform.
|
The observed cryptic and skipped exon isoforms are classified as simple, complex or alternating. A cassette exon is deemed as simple when there is no change in the use of the flanking splice sites, as complex when one or both of the flanking exons utilizes a different inner splice site, or as alternating when transcripts contain one of two consecutive exons, and this classification is detailed in Table 3. Further, cassette exons are seen to be of below average length (more so for cryptic exons than skipped exons), to be modular in 48% of cases, which is a modest but significant increase on the overall figure of 40%, while overall the events are seen to preserve phase in a little over one-half of instances. For cases where the frame is broken, and for both cryptic and skipped exons, the numbers of multi-frame exons observed as cassette exons and flanking cassette exons was close to that expected by chance alone (given the linear model of multi-frame exon distribution given in Figure 2).
|
The average strengths of the splice sites involved in cryptic and skipped exon events are shown in Table 4. Cassette exons in general have splice sites of below average strength, with the splice sites of cryptic exons weaker than those of skipped exons. A breakdown of the types of donor splice sites used in different types of alternative splicing is given in Table 5, and examination of the cryptic and skipped exons and their covering introns reveals an avoidance of G3 type donor splice sites associated with cryptic exons, and also with the covering introns for skipped exons. This observation is related to G+C content biases associated with different types of alternative splice events (Fig. 6); however, the fact that the skipped exons and the covering introns of cryptic exons do not demonstrate a similar bias indicates that G+C content alone is not a sufficient explanation.
|
|
Finally we observe that the phase distributions for the introns associated with cassette exons are close to that observed overall, apart from the introns downstream flanking cryptic exons, where we observe a phase distribution split of 33/36/31%, indicating fewer than expected exons with end phase 0, and an increase in end phase 2 exons (this observation is statistically significant at the 99% level with a
2 value of 11.1).
Exon and intron isoforms
Exon isoforms are created by extension or truncation events occurring at one or both ends of an exon, and can be observed in the data set through both the comparison of confirmed exons, and through comparison of confirmed introns. Comparison of confirmed exons leads to the identification of 247 isoforms, while for intron comparison we observe 351 isoforms. In both cases we observe twice as many alternative splicing events occurring at the acceptor splice site (in agreement with previous studies) (1) as we do at the donor splice site, and over twice as many truncations of annotated forms as we do extensions. Only nine of the intron comparisons and four of the exon comparisons (
2%) show isoforms using alternative splice sites at both ends. Given the larger size of the isoform data set derived from intron comparison, and the limited potential for analysis of frame preservation with the overlapping exon data, we use the intron overlap data in the analysis presented below.
A breakdown of the intron isoform events is given in Table 6, and it is seen that: (i) over one-third of the events observed involve 10 nt or less (with over half being 30 nt or less), and many of these are small acceptor site truncations; (ii) almost two-thirds (61%) preserve the frame of the transcript; and (iii) there is a significant over-representation of phase one introns. The average strengths of the splice sites involved in donor and acceptor truncation and extension events are shown in Table 4, and it can be seen that the constitutive splice sites tend to have below average strengths, while the alternative splice sites tend to be weaker still. It is of particular note that the acceptor site truncation events are both the largest group, and show the greatest difference in consensus between the constitutive and alternative form. A breakdown of intron groups used in isoform events reveals in particular an over-representation of G3 introns in isoforms involving alternative acceptor site usage (Table 5).
|
In the case of the small modifications (10 nt or less), only 59% of the normal forms and 15% of the alternative forms have a strong PPT, while 40% and 41%, respectively, have strong BPSs. While the reduction in strong PPTs observed for the alternative forms may be attributed to the definition that we have used (for a PPT to be defined as strong, it must extend to within 5 nt of the AG), it does indicate that these events are utilizing the same BPS and PPT, with only the acceptor dinucleotide AG being alternative. This further suggests that variation of the distance between the BPS and the alternative AGs plays a dominant role in determining the competitiveness of the nearby AGs. This is further substantiated by the observation that there is no discernable consensus at acceptor positions +1 and +2 for either the normal or alternative small acceptor site isoforms.
As for the cryptic and skipped exons, we observe that close to the expected number of multi-frame exons flank the intron isoform events; however, the proportion of downstream exons that code after frame shifts is higher than expected. Specifically, we observe 20 cases where a frame breaking intron isoform is followed by an exon that can translate in exactly two of the three possible frames, and of these we observe 16 cases that translate in the new frame, and four that do not (the null hypothesis would expect an even split).
Intron retention
We observe 57 intron retention events within 38 confirmed exons (there are cases of multiple intron retention). We find that the observed whole intron retention events are biased towards genes with high G+C contents, and that the splice sites of the exons tend to be of above average strength, while the splice sites of the retained introns tend to be of below average strength. Also, only five of these exons were capable of translation (in any one of the three sense frames) without introducing a stop codon.
| DISCUSSION |
|---|
|
|
|---|
We have constructed a data set of 16 269 confirmed introns and 11 812 confirmed exons from within 2793 human genes. Many of these introns and exons are seen to be incompatible with a single intronexon gene structure and therefore represent alternative splicing.
The intron groups
The confirmed introns were classified into one of six groups on the basis of the donor splice site (Table 1). This results in 93% of the GT-AG introns being categorized as having acceptable donor strength (
3 bits), and with either adenosine (53%) or guanosine (40%) in the third donor position (and hence the labels A3 and G3 for these groups). The G3 introns were observed to have a much stronger consensus at donor positions +4 and +5 than the A3 group (Fig. 4), as well as being, on average, substantially shorter. They are over-represented in events involving alternative acceptor site usage, and under-represented in events involving cryptic and skipped exons (Table 5). These observations can, in part, be explained in terms of G+C content biases.
A further 4.9% of GT-AG introns had acceptable overall donor strength, but with a cytosine or uracil at donor position +3 (the Y3 group), and these introns were seen to have greatly increased consensus at the donor exon positions (1 and 2), as well as a strong consensus at +5. There remained 1.4% of GT-AG (U2-type) introns with very weak donor signals (the weak group), and recognition of their donor sites probably requires the aid of regulatory sequences. It is of note that this group has the shortest average intron length. Finally, 1.1% of the introns were GC-AG introns, and 0.4% were GT-AG U12-type introns (the U12 group). These minor intron categories, particularly the U12 and weak groups, display atypical phase distributions (Table 1).
Rare introns
GC-AG introns. These introns [which are discussed in some detail by Thanaraj and Clark (31) and of which 60% are alternative forms] show a very strong donor consensus for the constitutive forms, while many of the alternative forms show very weak donor consensus but strong consensus at acceptor exon positions. It is interesting to note that GC-AG introns are equally distributed between the high and low GC partitions, indicating that compositional bias is not responsible for the choice of these minor introns. This fact, combined with their high average length and prevalence in alternative forms, gives weight to the suggestion that these forms are highly regulated. We further observe that the genes involved are associated with a wide spectrum of diseases, including Batten disease, leukemia and Duncans disease (31) (http://www.ebi.ac.uk/~thanaraj/gcag/).
Putative U12-type GT-AG introns. This group consisted of 60 introns with a donor consensus strongly indicative of U12-type GT-AG introns, 45 of which also have clearly identifiable U12-type BPSs. These 45 U12-type GT-AG introns showed a consensus of [CT][TA] at the donor exon positions 2 and 1 and ATAT at the acceptor exon positions (+1 to +4) and such sequences are very different from those observed for U2-type GT-AG introns (AG|gt...ag|GT). Such consensus sequences conflict with the conventional role of U5 snRNA (which tethers the two exons for ligation) and thus it would be interesting to know which component of the U12-spliceosome performs this function. Inspection of the genes containing such introns revealed that they code for biochemically important proteins, including kinases, ion carrier proteins, epilepsy holoprosencephaly candidate protein, stress activated protein kinase, Lowes culocerebrorenal syndrome protein, anti-inflammatory drug binding protein, RNA polymerase and mismatch repair protein (AltExtron data sets). A compilation of related U12-type AT-AC introns can be found in the AltExtron data sets and in Burge et al. (21) and Wu and Krainer (23).
Putative AG-dependant introns. We have identified a set of 280 putative AG-dependant introns that have no apparent polypyrimidine sequence. These are interesting candidates for the study of AG-dependant definition of the branch point in the second step of splicing.
Exons/introns of unusual lengths. Given further as individual data sets are (i) 572 exons that are of length <50 nt (including 57 <25 nt), (ii) 177 exons that are of length >400 nt and (iii) 48 introns that are of length <70 nt (including 23 <50 nt). These data sets may be of use to study 'intron definition' versus 'exon definition' models.
Exon modularity, translatability and protein evolution
It is reasonable to propose that the observed phase and translatability characteristics of exons provide an underlying structure that is important for facilitating gene evolution through changes in the pattern of splicing. In particular, the phase distribution results show an overall preference for modular exons (multiple of three nucleotides in length), and it has been demonstrated in the literature (14,15,19 and references therein) that such a distribution can be accounted for, in part, by exon duplication and shuffling events. Such events are part of the evolutionary process by which genes and proteins evolve. Since alternative splicing is a cellular mechanism for the generation of protein variants, it is possible that the observed phase distribution may reflect the evolution of both constitutive and alternative splicing. This proposition is supported by our observation that 4850% of skipped/cryptic exons are modular (Table 3), a significant increase over the 40% observed for exons in general. Further, over one-third of exons can translate in more than one frame, with the presence of so many multi-frame exons made possible by the comparatively short exons in human (and vertebrates in general). We have also observed a tendency for alternative frames of translation to persist across consecutive exons.
Pellegrini and Yeates (34) have shown, using amino acid substitution matrices for alternate frames of translation, a weak but measurable relatedness between the protein sequences of distinct protein families and have proposed that some proteins may have evolved from others through changes in the frame of translation. Alternative splicing is one mechanism through which changes in the frame of translation can be achieved, and it appears that some of the alternative splicing events that introduce frame shifts utilize an alternative translation of the downstream exon(s). As it is known that these possibilities are utilized in some instances [for example, the rat heparin-binding protein FGFR2 (35), ß-adducin from human bone marrow (36), and plasma membrane Ca2+-pumping ATPase 4 PMCA4 (37)], the question that requires answering is that of the extent to which translation in multiple frames occurs in alternative isoforms.
Relationship between donor and acceptor splice signals and intron/exon lengths
Our observation of a compensatory relationship in the strength of splice signals across exons (Fig. 5) is consistent with the exon definition model of splicing, which holds that identification of splice sites is coupled (28), and that in cases where the introns tend to be much longer than the exons, this coupled identification occurs across exons. Experimental evidence indicates that both intron and exon definition occurs in vertebrates (38), and thus it is of particular interest that we observe this compensatory relationship to be a characteristic of A3 exons, and not of G3 exons. It is the case that A3 exons tend to be co-incident with both low G+C content and with long introns, while the converse is the case for G3 exons. We hypothesize that A3 and G3 donor splice sites are functionally different, and that A3 donor sites are associated with exon definition, while G3 donor sites are associated with intron definition. It is thus of interest to note the bias away from the use of G3 donor sites in the introns covering skipped exons, and in the introns immediately downstream of cryptic exons (Table 5). It is also interesting to note that the donor sites of exons with acceptor site truncations have a tendency, according to this hypothesis, to be defined by intron definition.
G+C content
While the overall G+C content of the human genome is 41% (5), one-half of the genes observed in this analysis have G+C contents >49%. It has been reported that relative gene density increases more than 10-fold as G+C content increases from 30 to 50% (5). That the presence of genes is so heavily biased towards regions of comparatively high G+C content is intriguing, especially in the light of our observation that these regions have a large percentage of exons able to translate in multiple frames (Fig. 2), as well as smaller introns, and hence genes. It can be said that high G+C regions have a high coding potential.
The reported observations on individual relationships between splice signals and G+C content demonstrate the pervasive effect of base composition on regulatory signals. It might be that the base composition of cis-acting elements in general correlates with G+C content and other regional base composition properties, and that this introduces a degree of order into regulatory organization. It is interesting to combine this speculation with the commonly held view, and one that is supported by the data presented in this paper, that alternative isoforms have weak constitutive signals allowing regulation by other cis-acting regulatory elements, and to consider the particular observation that the cassette exons occur more often in genes with low G+C content, while exon isoforms occur more often in genes with high G+C content. Thus we offer the hypothesis that the regulatory elements that regulate cryptic and skipped exons are A+T-rich, whereas those that regulate exon isoform events are, overall, G+C-rich. It has conventionally been thought that such regulatory elements are either purine-rich [recognized through serine/arginine-rich (SR) proteins] or pyrimidine-rich (through hnRNP or polypyrimidine binding proteins); however, there is an emerging view that G+C-rich (or A+T-rich) motifs are also quite possiblefor example when Stamm et al. (32) consider all the regulatory motifs that they identify in subgroups of tissue-specific alternatively spliced exons, they find the G+C composition is 61%. It is also known that there is a high degree of degeneracy in motifs recognized by SR proteins and the consensus for individual SR proteins can be skewed to a high G+C content (39).
Alternative splicing/alternative forms
We have demonstrated that constitutive cassette exons that are omitted in a minor alternative form (skipped exons) have different properties to cassette exons that are absent in the constitutive form (cryptic exons). These properties include (i) the avoidance of G3 donor sites in cryptic exons, and in the introns covering skipped exons (but not in the skipped exons and the introns covering cryptic exons) (Table 5) and (ii) the transcript coverage for cryptic exons and the introns covering skipped exons is typical of alternative events, whereas the coverage of skipped exons and the introns covering cryptic exons is typical of constitutive events (Table 2).
The splice sites utilized in alternative gene structures are in general of below average strength and differential splice strength is seen to be a fundamental factor in alternative splicing (Table 4). In the case of cryptic and skipped exons, the splice sites are seen to be weak, both in comparison to the covering introns and the data set overall (with an equivalent observation for the retained intron events). In the case of the exon truncation/extension events, the constitutive splice sites were seen to be weak, with the alternative sites being even weaker.
Donor sites from cryptic exons and the three snRNA molecules that interact at the donor site. Examination of the splice sites from cryptic exons reveals a number of distinctive features (data not shown). First, donor position +3 shows a stronger consensus overall to A (up by 7% to 63%), which enables WatsonCrick type (WC) base pairing at that position with U1 snRNA. Secondly, there is a drop at +4 of 13% to 56%, leading to less WC base pairing with U1 snRNA and probably maximizing for base pairing with U6 (U1 snRNA requires A at donor position +4 while U6 snRNA requires U). Also observed is a 12% increase (to 41%) in A at donor position 4 (this would enable WC base pairing with U5 snRNA). This modulation of the binding efficiency of the three snRNAs at the donor site is consistent with their order of binding to the donor region on the pre-mRNA; the binding of U1 snRNA at the donor exon and intron positions is replaced in the latter part of the first stage of splicing by the binding of U6 snRNA with the intron nucleotides and U5 snRNA with the exon nucleotides. In addition, it is known that the relative levels of binding of the U6 and U5 snRNA molecules can determine the choice between nearby donor sites (40) as supported by the following observations: (i) U1 snRNAs stimulated U6-defined proximal donor sites more efficiently than distal ones, and thus the extent of U6 snRNA base pairing to alternative donor sites can determine preferences if there is a sequence nearby to which U1 snRNA can base pair (41) and (ii) there are circumstances in which the base pairing of U5 snRNA can affect alternative donor site preferences (42).
Acceptor site truncations and the scanning process for AG selection. It was observed that acceptor site truncations are the most common form of exon isoform event and, further, that nearly half of such truncation events are by <10 nt. It is reasonable to propose that, in the case of small acceptor site truncations, a single branch point and PPT are shared by the normal and alternative forms, and that regulation between these forms is achieved by modulation of the 'scanning mechanism' that has been proposed for the selection of AG (43,44). It is also the case that when two AG dinucleotides are located downstream of the BPS, and are themselves separated by <6 nt, the AG closest to the BPS is initially recognized, but the downstream AG may ultimately be utilized in the splicing reaction (45,46).
Concluding remarks
We have presented an analysis of transcript-confirmed introns and exons (the AltExtron data set), in which specific groups of introns and exons have been identified, and the properties of these groups, particularly with regard to the splicing signals, have been examined. We have illustrated variations in these properties between constitutive and alternative exonintron structures, and analyzed the associations of both splicing signals and the types of alternative splicing with global contexts, including G+C content and intron phase. The effect of alternative events on the coding frame of transcripts has been determined and quantified. The modularity and translatability of human exons has been assessed and found, at least in part, to facilitate alternative splicing. The AltExtron data set is presented as a number of flat files, tables and lists, and is available at http://www.bit.uq.edu.au/altExtron/ and http://www.ebi.ac.uk/~thanaraj/altExtron/. Finally, we have presented numerous observations, and some specific hypotheses, that are now best addressed by direct experimental research.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Gene and transcript data
The data set of genes and transcripts used in this work is based on GenBank (11) release 117, from which we extracted 5468 intron-containing human gene entries, 1 844 703 human EST sequences and 45 499 human mRNA sequences. In cases where a gene had alternative annotated gene structures, one was arbitrarily chosen. From this data set of gene sequences we removed: (i) any duplicate gene or gene fragment based on 99% un-gapped sequence identity (415 duplicates); (ii) any gene with a match to a data set of hypervariable genes (derived from GenBank IgBlast 11-May-2000) with expectation less than 1e-10 (466 hypervariable genes); and (iii) any gene where an annotated intron did not conform to either the GT-AG or GC-AG splice site consensus (500 genes containing a non-canonical splice site).
Genetranscript alignments and redundancy removal
We used BLAST (version 2.0.9) (47,48) to identify matches between gene and transcript sequences. BLAST was used without the filter for cryptically simple sequence and with an expectation cut-off of 1e-10 applied (which corresponds to the smallest reported matches being
35 nt long). We did not mask the sequences for repeats; instead we identified repeats in the alignment datawhere two or more genes had matches to the same region of a transcript (given an end point tolerance of 20% of the match length), these matches were considered to be repeat matches and were discarded. This method of identifying repeats, while requiring that a large data set be examined, has the distinct advantage that no a priori knowledge of the repetitive elements is required. After removal of repeats, any match with <95% sequence identity was removed, the value of 95% being chosen as the cut-off after examination of the distribution of match identity values.
Further, we removed any alignments involving transcripts that were associated with more than one gene. This step acts as a specific and local form of redundancy analysis. Note that if we had utilized a conventional redundancy analysis, whereby all but one copy of highly homologous gene groups are removed at the outset, then it is possible that the data set may contain otherwise avoidable alignments between genes and transcripts transcribed from homologous, but distinct, genes. A conventional redundancy analysis on the final gene data set revealed only 66 genes (2.4%) that matched other genes in the data set with an identity of 90% or more. These genes contribute 158 introns and 67 exons to the data set, which is <1%.
Finally we removed any alignments that consisted of only a single match. The remaining alignments (incorporating 2793 genes, 84 065 ESTs and 4227 mRNAs) contained 185 710 consecutive match pairs (Fig. 1), many of these existing within groups demonstrating the same alignment.
Patching the alignment data
Though the treatment of the data in the stringent fashion described above acted to remove artefactual matches, it also removed many valid matches and contributed to fragmented transcript coverage for many genes. There are three reasons why matches between exons and transcript data may have been missed: (i) the identity of the match was <95%; (ii) the length of match was less than
35 nt; or (iii) the match was rejected as a repeat. In order to ameliorate this effect (to maximize the number of consecutive exons observed) we carried out the following procedure, one that patched the scaffold of determined matches.
Where a transcript contained an unmatched region of at least 10 nt, either between two matches or adjacent to a terminal match, we examined any BLAST matches between the gene and transcript without the application of criteria (ii) and (iii) above. In particular, when patching for missed internal exons, the highest scoring BLAST match was included in the genetranscript alignment data if it closed the gap. If the first and second highest scoring BLAST matches (corresponding to two exons) were required to cover the gap in the transcript alignment then, provided the two matches did not overlap each other by more than 10 nt, both matches were incorporated into the genetranscript alignment data. The same principles were applied to search for missed terminal exons.
Examining the alignments to identify confirmed introns and exons
For each genetranscript alignment, each pair of consecutive matches was examined to determine if they unambiguously defined the position of an intron in the gene sequence as shown in Figure 1. Consecutive match pairs were considered to demonstrate a confirmed intron if they (i) overlapped on the transcript by between 0 and 10 nt; (ii) indicated the presence of an intron of at least 10 nt in length; and (iii) allowed for a GT-AG or GC-AG intron consistent with the positioning of the matches. Where a genetranscript match was flanked on both sides by confirmed introns, the match was considered to demonstrate a confirmed exon. Analysis of the 185 710 match pairs gave 9474 cases (5.1%) which failed to satisfy alignment conditions (i) or (ii) and 4488 cases (2.4%) which failed to satisfy condition (iii). There remained 166 147 cases (89%) showing annotated forms and 5601 cases (3%) showing un-annotated forms.
Methodological biases in the data set
The method fundamentally identifies introns, with exons determined via the identification of both flanking introns within a single transcript. This, combined with the fact that our initial gene sequences did not include flanking sequence, and the fact that the annotation of many genes does not include the untranslated region sequences, biases the data set away from introns and, particularly, exons that lie at the extremities of genes. This is in addition to the fact that terminal exons, which in any event have quite different properties to internal exons, are absent from our analysis. It might be expected that our method would create a bias towards short exons, however we observe a mean exon length of 140 nt, which compares with a value of 145 nt reported by human genome sequencing consortium (5) for human internal exons.
Verification of annotated CDS
Each gene in the data set that had annotated CDS was examined in order to verify the annotation. Where the translation was also provided in the annotation we compared this to a translation of the annotated CDS. Where no translation was provided we examined the annotated CDS to ensure that it did indeed translate; where necessary phase adjustments were made to the annotated starting point. There were 134 cases where no CDS annotation was provided, and 72 cases where the annotation failed verification.
Weight matrices and Schneider information scores
We utilized a weight matrix approach to measure the strength of splice sites and BPSs. A collection of sequences that represent a sequence motif, such as a donor splice site, can be used to construct a matrix that describes the frequency of nucleotide usage at each positionthe information content (in units of bits) of an individual sequence is calculated as a function of this consensus matrix (20). We chose to modify the standard calculation (20) such that the minimum score allowed for a given (rare and non-consensual) nucleotide is bounded below at 2 bitsreflecting the maximum possible score of +2 bits. In this work we considered the donor splice site to consist of the nucleotides 4 to +8, and the acceptor site to consist of the nucleotides 20 to +4. We used the acceptor sites of all observed introns to construct the acceptor site matrix, while dividing donor sites into GT and GC categories.
Identification of U2 BPSs
The U2 BPS has a consensus of [CT]T[AG]a[CT], (ideally CTAaC) around the bulge adenosine a (9). Although this consensus is of low information content, we were still able to perform a useful analysis. In the first instance we examined the region 40 to 20 from the acceptor splice site (9) for matches to the above consensus, but allowed only two of the three degenerate positions to differ from the ideal case. This generated a reference set of 2912 sequences with which we constructed a nucleotide frequency matrix, and we used this matrix to search the (up to) 100 nt 5' adjacent to the acceptor site in each intron. In this search we imposed the requirement that the branch point position must contain an adenosine. This analysis was also performed on a control set of 10 000 pseudo random sequences. Examining only those cases where the putative branch point was located in the region 40 to 20, we observed a substantial departure of the data set from the control set at a cut off value of 4 bits. We used this as the cut off for inclusion of candidate U2 branch points in the data set, although we apply a threshold of 5 bits to all analysis presented in this manuscriptthis being equivalent to requiring [CT]T[AG]a[CT] (12 971 cases). Finally, an intron is said to have a strong BPS if there is a candidate BPS, scoring at least 5 bits, in the region 50 to 10.
Identification of PPTs
We developed a simple heuristic technique for identifying candidate PPTs based on the following observations (49): (i) a PPT can be as short as 5 nt, but in such a case it is a polyuridine; (ii) a shorter PPT requires a higher uridine content than a longer one; (iii) uridine content is more important than cytosine content; and (iv) a polymer of alternating guanosine and uridine can be a functional PPT.
We examined, for each intron, the 3'-terminal 100 nt (or the whole length for shorter introns), and assigned a score to each position based on the nucleotide at that position and those in the flanking four positions (two on either side). The nucleotides were assigned values of 2 for A and G, +2 for C and +3 for U, as well as a positional weighting of 1 for the ±2 positions, 2 for the ±1 positions and 3 for the central position. After scoring, all runs of nucleotides with scores of 2 or more were examined further. After each run was, if necessary, pruned or extended by one nucleotide at the end points to ensure that it started and ended with a pyrimidine, any run of 10 nt or greater was considered to represent a candidate PPT, while any run of length 59 nt containing at least five uridines was also considered as a candidate PPT. This analysis was also applied to a control data set of 10 000 pseudo random sequences. We identified an average of 2.38 PPTs per intron, with a length distribution of 15.5 ± 7.9 nt, compared to an average of 1.53 PPTs per control sequence, with a length distribution of 12.6 ± 4.4 nt. Further, 86% of the introns in the data set had a candidate PPT within 10 nt of the acceptor splice site, compared to only 22% in the control set. Finally, an intron is said to have a strong PPT if a candidate PPT extends to within 5 nt of the acceptor splice site.
| ACKNOWLEDGEMENTS |
|---|
The authors thank Kevin Burrage, John Mattick and Alan Robinson for their support and encouragement, and also Soren Schandoff and Larry Croft for their involvement in the initial development of the data set. Further, we thank Stefan Stamm and Juan Valcarcel for a careful reading of the manuscript and for their suggestions.
| FOOTNOTES |
|---|
+ To whom correspondence should be addressed. Tel: +44 1223 494650; Fax: +44 1223 494468; Email: thanaraj@ebi.ac.uk
| REFERENCES |
|---|
|
|
|---|
1 Mironov,A.A., Fickett,J.W. and Gelfand,M.S. (1999) Frequent alternative splicing of human genes. Genome Res., 9, 12881293.
2 Brett,D., Hanke,J., Lehmann,G., Haase,S., Delbruck,S., Krueger,S., Reich,J. and Bork P. (2000) EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett., 474, 8386.[ISI][Medline]
3 Croft,L., Schandorff,S., Clark,F., Burrage,K., Archtander,P. and Mattick,J.S. (2000) ISIS, the intron information system, reveals the prevalence of alternative splicing in the human genome. Nat. Genet., 24, 340341.[ISI][Medline]
4
Thanaraj,T.A. (1999) A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures. Nucleic Acids Res., 27, 26272637.
5 International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[Medline]
6
Kan,Z., Rouchka,E.C., Gish,W.R. and States,D.J. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res., 11, 889900.
7
Modrek,B., Resch,A., Grasso,C. and Lee,C. (2001) Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res., 29, 28502859.
8 Krawczak,M., Reiss,J. and Cooper,D.N. (1992) The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum. Genet., 90, 4154.[ISI][Medline]
9 Burge,C.B., Tuschl,T. and Sharp,P.A. (1999) Splicing of precursors to mRNAs by the spliceosomes. In Gesteland,R.F., Cech,T.R. and Atkins,J.F. (eds), The RNA WorldThe Nature of Modern RNA Suggests a Prebiotic RNA. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, pp. 525560.
10 Dietrich,R.C., Incorvaia,R. and Padgett,R.A. (1997) Terminal intron dinucleotides do not distinguish between U2- and U12-dependent introns. Mol. Cell, 1, 151160.[ISI][Medline]
11
Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 1518.
12 Duret,L., Mouchiroud,D. and Gautier C. (1995) Statistical-analysis of vertebrate sequences reveals that long genes are scarce. J. Mol. Evol., 40, 308317.[ISI][Medline]
13
Nakamura,Y., Gojobori,T. and Ikemura,T. (2000) Codon usage tabulated from the international DNA sequence databases. Nucleic Acids Res., 28, 292292.
14
de Souza,S.J., Long,M., Klein,R.J., Roy,S., Lin,S. and Gilbert,W. (1998) Toward a resolution of the introns early/late debate: Only phase zero introns are correlated with the structure of ancient proteins. Proc. Natl Acad. Sci. USA, 95, 50945099.
15 Fedorov,A., Fedorova,L., Starshenko,V., Filatov,V. and Grigorev,E. (1998) Influence of exon duplication on intron and exon phase distribution. J. Mol. Evol., 46, 263271.[ISI][Medline]
16
Long,M., de Souza,S.J., Rosenberg,C. and Gilbert,W. (1998) Relationship between 'proto-splice sites' and intron phases: evidence from dicodon analysis. Proc. Natl Acad. Sci. USA, 95, 219223.
17 Long,M. and Deutsch,M. (1999) Association of intron phases with conservation at splice site sequences and evolution of spliceosomal introns. Mol. Biol. Evol., 16, 15281534.[Abstract]
18
Long,M. and Rosenberg,C. (2000) Testing the proto-splice sites model of intron origin: evidence from analysis of intron phase correlations. Mol. Biol. Evol., 17, 17891796.
19 Tomita,M., Shimizu,N. and Brutlag,D.L. (1996) Introns and reading frames: Correlation between splicing sites and their codon positions. Mol. Biol. Evol., 13, 12191223.[Abstract]
20 Rogan,P.K., Faux,B.M. and Schneider,T.D. (1998) Information analysis of human splice site mutations. Hum. Mutat., 12, 153171.[ISI][Medline]
21 Burge,C.B., Tuschl,T.H. and Sharp,P.A. (1998) Evolutionary fates and origins of U12-type introns. Mol. Cell, 2, 773785.[ISI][Medline]
22 Burge,C.B. (1998) Modelling dependencies in pre-mRNA splicing signals. In Salzberg,S., Searls,D. and Kasif,S. (eds), Computational Methods in Molecular Biology. Elsevier Science, Amsterdam, The Netherlands, pp. 127163.
23
Wu,Q. and Krainer,A.D. (1999) AT-AC pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Mol. Cell. Biol., 19, 32253236.
24 Kriventseva,E.V. and Gelfand,M.S. (1999) Statistical analysis of the exon-intron structure of higher and lower eukaryote genes. J. Biomol. Struct. Dyn., 17, 281288.[ISI][Medline]
25
Gorodkin,J., Heyer,L.J., Brunak,S. and Stormo,G.D. (1997) Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci., 13, 583586.
26
Thanaraj,T.A. and Robinson,A.J. (2000) Prediction of exact boundaries of exons. Brief. Bioinform., 1, 343356.
27
Zhang,M.Q. (1998) Statistical features of human exons and their flanking regions. Hum. Mol. Genet., 7, 919932.
28
Berget,S.M. (1995) Exon recognition in vertebrate splicing. J. Biol. Chem., 270, 24112414.
29
Dominski,Z. and Kole,R. (1992) Cooperation of pre-mRNA sequence elements in splice site selection. Mol. Cell. Biol., 12, 21082114.
30
Bell,M.V., Cowper,A.E., Lefranc,M.P., Bell,J.I. and Screaton,G.R. (1998) Influence of intron length on alternative splicing of CD44. Mol. Cell. Biol., 18, 59305941.
31
Thanaraj,T.A. and Clark,F. (2001) Human GC-AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions. Nucleic Acids Res., 29, 25812593.
32 Stamm,S., Zhu,J., Nakai,K., Stoilov,P., Stoss,O. and Zhang,M.Q. (2000) An alternative-exon database and its statistical analysis. DNA Cell Biol., 19, 739756.





