Non-coding RNA
Australian Research Council Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, St Lucia, QLD 4072, Australia
* To whom correspondence should be addressed. Tel: +61 733462079; Fax: +61 733462111; Email: j.mattick{at}imb.uq.edu.au
Received February 13, 2006; Accepted February 22, 2006
| ABSTRACT |
|---|
|
|
|---|
The term non-coding RNA (ncRNA) is commonly employed for RNA that does not encode a protein, but this does not mean that such RNAs do not contain information nor have function. Although it has been generally assumed that most genetic information is transacted by proteins, recent evidence suggests that the majority of the genomes of mammals and other complex organisms is in fact transcribed into ncRNAs, many of which are alternatively spliced and/or processed into smaller products. These ncRNAs include microRNAs and snoRNAs (many if not most of which remain to be identified), as well as likely other classes of yet-to-be-discovered small regulatory RNAs, and tens of thousands of longer transcripts (including complex patterns of interlacing and overlapping sense and antisense transcripts), most of whose functions are unknown. These RNAs (including those derived from introns) appear to comprise a hidden layer of internal signals that control various levels of gene expression in physiology and development, including chromatin architecture/epigenetic memory, transcription, RNA splicing, editing, translation and turnover. RNA regulatory networks may determine most of our complex characteristics, play a significant role in disease and constitute an unexplored world of genetic variation both within and between species.
| INTRODUCTION |
|---|
|
|
|---|
Until recently most of the known non-coding RNAs (ncRNAs) fulfilled relatively generic functions in cells, such as the rRNAs and tRNAs involved in mRNA translation, small nuclear RNAs (snRNAs) involved in splicing and small nucleolar RNAs (snoRNAs) involved in the modification of rRNAs. The central tenet of molecular biology, developed from the study of simple organisms like Escherichia coli, has been that RNA functions mainly as an informational intermediate between a DNA sequence (gene) and its encoded protein. The presumption has been that most genetic information that specifies biological form and phenotype is expressed as proteins, which not only fulfill diverse catalytic and structural functions, but also regulate the activity of the system in various ways. This is largely true in prokaryotes and presumed also to be true in eukaryotes. Reciprocally, the extensive sequences in the higher eukaryotes that do not encode proteins or cis-acting regulatory elements (i.e. the majority of the vast tracts of intronic and intergenic sequences) have been regarded as simply accumulated evolutionary debris arising from the early assembly of genes and/or the insertion of mobile genetic elements.
However, most of these supposedly inert sequences are transcribed. It is also increasingly evident that RNA itself can and does have a very wide repertoire of biological functions (1
) and, in particularas first predicted by Jacob and Monod 45 years ago (2
)that it is widely employed as a means of gene regulation, both in cis and in trans, especially in the higher eukaryotes. These RNAs are the subject of this review.
| EXPANSION OF ncRNAs AND RNA METABOLISM IN EUKARYOTES |
|---|
|
|
|---|
A limited number of trans-acting small ncRNAs have been described in prokaryotes that appear mainly to regulate mRNA translation or stability. Over 60 such RNAs have been identified during the past few years in E. coli, with another 200 or so predicted bioinformatically (3
In contrast, the higher organisms have a relatively stable proteome, and a relatively static number of protein-coding genes, which is not only much lower than expected but also varies by less than 30% between the simple nematode worm Caenorhabditis elegans (which has only 103 cells) and humans (
1014 cells), which have far greater developmental and physiological complexity (15
). Moreover, only a minority of the genomes of multicellular organisms is occupied by protein-coding sequences, the proportion of which declines with increasing complexity, with a concomitant increase in the amount of non-coding intergenic and intronic sequences, most of which are in fact transcribed [(15
,16
); discussed in more detail subsequently]. Thus, there seems to be a progressive shift in transcriptional output between microorganisms and multicellular organisms from mainly protein-coding mRNAs to mainly non-coding RNAs, including intronic RNAs.
The eukaryotes, particularly the higher eukaryotes, also have a far more developed RNA processing and signaling system than prokaryotes, which appears to be linked to the more sophisticated pathways of gene regulation and complex genetic phenomena in eukaryotes, transcriptional and post-transcriptional gene silencing, including RNA interference (RNAi), DNA methylation and chromatin modification, imprinting, and other phenomena such as transvection, transinduction, dosage compensation and position effect variegation (8
,17
,18
). The higher eukaryotes also have a large repertoire of RNA-binding proteins as well as many nucleic acid- and chromatin-binding proteins whose exact specificity is unknown or uncertain, but which may recognize different types of RNA:RNA and RNA:DNA complexes (8
,18
).
Both theoretic considerations and empirical evidence indicate that the amount of regulatory overhead scales non-linearly with complexity in all integrated systems, and that regulatory architecture will progressively dominate the information content of more complex systems, leading to complexity limits, until and unless there is a change in the physical basis of the regulatory architecture itself (19
). The generic solution to this accelerating regulatory problem is the superimposition of digital communication and control systems, which have only been broadly established in the human intellectual lexicon during the past 2030 years, well after the central tenets of molecular biology were developed and after introns were discovered. Interestingly, although it is widely appreciated that DNA itself is a digital storage medium, it has not been considered that some of its outputs may themselves be digital signals, communicated via ncRNA, in addition to the mRNAs encoding analog components (i.e. the proteins), albeit with many design variations elaborated by alternative splicing (which itself requires regulation).
Regulatory proteins scale almost quadratically with genome size in prokaryotes (20
,21
), and extrapolation of this relationship suggests that prokaryotes have been limited in their complexity by their reliance on a protein-based regulatory architecture, probably for most of their evolutionary history (13
,19
,20
,22
). Conversely, it appears that the eukaryotes breached this limit by the co-option of RNA as a digital regulatory solution, in concert with the evolution of the necessary protein infrastructure to recognize and act on these signals (13
). Indeed, both logic and evidence suggest that both developmental programming and the phenotypic difference between species and individuals is heavily influenced, if not fundamentally controlled, by the repertoire of regulatory ncRNAs (13
,16
18
,23
), which are only now being recognized and beginning to be studied in any systematic way.
| INFRASTRUCTURAL ncRNAs |
|---|
|
|
|---|
Some infrastructural ncRNAs have been known for a long time and have well-established functions. These include tRNAs, rRNAs, spliceosomal uRNAs or snRNAs and the common snoRNAs. Both translation and splicing require core infrastructural RNAs not only for sequence-specific recognition of RNA substrates, but also for the catalytic process itself (1
ncRNAs also play a role in chromosome maintenance and segregation (35
). A small RNA with similarity to box H/ACA snoRNAs is a component of telomerase (for review see 36
) and is mutated in autosomal dominant dyskeratosis congenita (37
). In humanchicken hybrid cells, mutation of Dicer, a key component of the siRNA/miRNA processing machinery, leads to the accumulation of transcripts derived from centromeric-satellite repetitive sequences, premature separation of sister chromatids and cell death (38
). ncRNA has also been implicated in the control of chromatin architecture and epigenetic memory (35
,39
; discussed further below).
There are also other types of infrastructural ncRNAs that are involved in central cell biological processes. The ncRNA 7SL RNA is a core component of the signal recognition particle (SRP), a ribonucleoprotein complex that interacts with the ribosome and is essential for targeting/transportation of nascent proteins containing signal peptides to the endoplasmic reticulum membrane for secretion or membrane insertion (40
43
).
The 13 MDa vault complex (discovered in 1986) is the largest ribonucleoprotein complex described to date, three times bigger (albeit far less complex) than the ribosome. It is present in 104 to 105 copies per cell, forms a barrel-like structure predominantly localized in the cytoplasm and is presumably involved in transport (for review see 44
). Different species have between one and three vault RNAs, ranging in length from 86 to 141 nucleotides. In multi-drug resistant cells, the vault complex is upregulated and has a different ratio of vault RNAs in comparison with normal (44
). Moreover, two human vault RNAs, hvg-1 and hvg-2, specifically bind to mitoxantrone (45
), a chemotherapeutic agent commonly used for treatment of breast cancer, myeloid leukemia and non-Hodgkin's lymphoma.
| cis-ACTING REGULATORY SEQUENCES IN NON-CODING REGIONS OF mRNAs and PRE-mRNAs |
|---|
|
|
|---|
Regulatory RNAs function in most cases by base-pairing with complementary sequences in other RNAs and DNA, to form RNA:RNA (and probably RNA:DNA) complexes that are recognized, and acted upon, by a relatively generic infrastructure [such as RNA-induced silencing complex (RISC) complexes or RNA editing enzymes]. There are many well-characterized examples of regulatory RNA sequences in the untranslated regions (UTRs) of mRNAs that act in cis as receivers of other trans-acting signals, by forming secondary structures that bind regulatory proteins or small molecular weight ligands. Examples of the former include sequences in UTRs that can bind regulatory proteins or be the targets of RNA editing to control the stability, translatability or localization of mRNAs (46
UTRs in mRNAs (as well as the coding sequences themselves) can also be the sensors of trans-acting regulatory RNAs, specifically miRNAs (at least some of which are encoded in introns of other genes), by base sequence recognition (8
,55
,56
), which appear to have significant influence on their evolution (57
). That is, ncRNAs can either be receivers or transmitters, or both, of regulatory signals. Interestingly, the average length of the UTRs in mRNAs increase with developmental complexity in animals, and is almost equivalent to the length of the protein-coding sequences in human (total 34 Mb of coding sequences and 32 Mb of UTR at last count) (15
), indicative of the much greater sophistication of mRNA regulation in the higher organisms.
There are also cis-acting regulatory sequences in and around splice junctions, some of which (the so-called exon-splicing enhancers or ESEs) occur within protein-coding sequences (58
). Nucleotide sequence conservation is higher around alternative splice sites than constitutive splice sites, albeit in complex patterns (59
61
). These sequences are thought to bind regulatory proteins that influence splice selection, but two recent papers have suggested that such selection may, at least in some cases, involve complex RNA:RNA interactions, which are themselves presumably regulated by other trans-acting signals, including other RNAs (62
64
). Consistent with this, small artificial antisense RNAs and introduced riboswitches have been shown to easily regulate splicing in vitro and in vivo (65
68
), with obvious implications for the natural mechanisms of splicing control (8
). A snoRNA has also been shown to control splicing of serotonin receptor 5-HT(2C)R mRNA (64
). In addition, a significant number of ultra-conserved sequences in mammals and insects are located at splice sites (63
,69
). It should be borne in mind that some protein-coding sequences may have dual function, and be themselves the targets of regulatory molecules, such as miRNAs and siRNAs, as has been well documented in plants (70
) and has been recently shown to occur in mammals (71
73
). It should also be borne in mind that many RNAs may combine both digital (i.e. sequence-specific) and analog (structure-based ligand/protein binding or catalytic) functions, and that we have barely yet scratched the surface of these functions and networks.
| LARGE NUMBERS OF NCRNAS EXPRESSED FROM THE MAMMALIAN GENOME |
|---|
|
|
|---|
The Ensembl 34b version of Human Genome annotation lists 22 287 known or predicted protein-coding gene loci. The coding regions occupy
34 Mb (
1.2%) of the euchromatic genome, and the total fraction of bases occupied by known protein-coding transcripts is only about 2% (15
|
Large-scale cDNA cloning studies have recently shown that there are many tens of thousands of transcripts expressed from the mouse genome, a large fraction of which (over 34 000) do not appear to encode proteins (75
It is also apparent that much of the mammalian genome is transcribed from both strands. It is estimated that 5880 human transcription clusters (22% of those analyzed) form senseantisense pairs with most antisense transcripts being ncRNA (80
), an arrangement that exhibits considerable evolutionary conservation between the human and pufferfish genomes (81
). A detailed analysis of the mouse transcriptome indicated that 43 553 (72%) transcriptional units overlap with transcripts coming from opposite strand (82
). In fact, there is evidence from spliced ESTs, annotated mRNAs and protein-coding genes listed on the UCSC Genome Database (83
) that at least 2.4 Gb of the human genome is transcribed, at least 25% from both strands (Fig. 1; M. Pheasant and J.S. Mattick, unpublished analysis). It would not be surprising if the true extent of transcription was greater than the size of the genome itself, noting that the upper limit is twice the genome size.
Genome tiling array (76
,77
) and massively parallel signature sequencing (MPSS) (78
) studies of various tissues and cell lines have independently revealed many thousands of non-coding transcripts from intergenic and intronic sequences in the human genome. Over 37% of the MPSS signatures matched known loci, but outside of annotated exons, with another 20% matching the complementary strand of known transcripts, indicating the presence of as many as 50 000 additional non-annotated RNAs in analyzed human tissues (78
). These findings are reinforced by the analysis of conserved RNA secondary structures which predict thousands of functional ncRNAs in the human genome (84
,85
).
High-density genome tiling array studies of 10 human chromosomes (approximately one-third of the human genome) showed that 9% of the non-repetitive sequences were expressed as detectable transcripts (transfrags) in individual cell lines, and that 16.5% of non-repetitive bases were transcribed in at least one out of eight cell lines analyzed, indicating that many of the observed RNAs are cell-type specific (77
), consistent with MPSS studies (78
). It should be noted that this figure is much higher than the total length of all mRNAs expected from these chromosomes. Over 56% of the detected transfrags do not overlap with any well-characterized exon, mRNA or EST annotation; 30% map with intergenic regions and 26% with introns of known genes. The latter do not appear to represent pre-mRNA contamination, as the signals were not generally spread across the introns, but rather showed discrete foci, indicative of previously unknown exons or of other RNAs (perhaps regulatory ncRNAs or their precursors) derived from these regions (77
). Moreover, for technical reasons these analyses are likely to overlook many important small regulatory RNAs such as miRNAs which may be present in only trace amounts and are difficult to label by reverse transcription.
Rapid amplification of cDNA ends (RACE) analysis of selected genomic regions (79
) confirmed the existence of these RNAs, and revealed an amazingly complex landscape of interlacing and overlapping transcripts, not only on opposite strands, but also on the same strand, so that there is often no clear distinction between splice variants and overlapping and neighboring genes, which had also been indicated by cDNA cloning studies (75
,82
). This study also showed that there are many hitherto unrecognized exons and splice variants even in very well-studied genes, such as that encoding Sonic Hedgehog, and that it is not unusual for a single base pair to be part of an intricate network of multiple isoforms of overlapping sense and antisense transcripts (Fig. 2). These observations all have important and challenging implications for genotypephenotype correlations, the complexity of the transcriptional regulation, and the definition of a gene (79
), which may now be best viewed as fuzzy transcription clusters with multiple products (18
).
|
Just as disturbingly, it appears that almost a large proportion of the transcripts in human and mouse are unique to the largely unstudied polyA and the nuclear polyA+ fractions of the transcriptome (77
| TRANSCRIPTIONAL NOISE OR MEANINGFUL OUTPUT? |
|---|
|
|
|---|
The observation that there are literally tens of thousands of ncRNAs expressed in mammals, and that most of the genome is transcribed, confronts and very largely contradicts the traditional protein-centric view of genetic information and genome organization. There are two opposing alternativeseither the bulk of the transcription which does not yield mRNAs is transcriptional noise and/or (in the case of introns) the residue of evolutionary baggage retained or accumulated within genes, or this transcription comprises another level of expression and transaction of RNA information that is important to the evolution and developmental ontogeny of the higher organisms (13
Most of the ncRNAs identified in genomic transcriptome studies have not been studied and have yet to be ascribed any function. However, there are many lines of evidence that suggest that these RNAs are biologically meaningful.
First, most intensively studied gene loci, including both those that are imprinted and conventional loci such as beta-globin, have been shown to express non-coding transcripts (91
96
). This includes some enhancers and conserved intergenic sequences (92
,97
).
Second, it is clear that many of these transcripts are cell-type specific, with specific subcellular locations, and are developmentally regulated (77
,98
,99
). A large number of ncRNAs are specifically expressed from either the paternal or maternal allele at imprinted loci, and some are associated with human diseases, such as the PraderWilli and Angelman syndromes (39
). Hence, the genetic cause for some, and perhaps many, diseases may be associated with mutations within ncRNAs. An imprinted ncRNA, LANCAT, spanning more than megabase in the murine region orthologous to the human PraderWilli/Angelman syndrome locus, exhibits a distinct expression pattern in brain, as well as a cytoplasmic location (100
). It has also been shown that some snoRNAs and miRNAs may be encoded within the introns of imprinted ncRNA genes (95
,101
). The snoRNA HBII-52 which regulates the splicing of the serotonin receptor 5-HT(2C)R gene is not expressed in PraderWilli syndrome patients which have different 5-HT(2C)R mRNA isoforms from normal, suggesting that this defect contributes to the PraderWilli syndrome (64
,102
). Antisense transcripts associated with eight transcription factor genes involved in eye development also display specific expression patterns in brain, and in the retina in particular (103
). Another non-coding antisense transcript, which has several alternatively spliced isoforms, shows an expression pattern similar to the sense-strand Foxl2 gene, which encodes a forkhead transcription factor involved in development of eyelid and ovary (104
).
Third, the upstream regions of ncRNA transcripts show many of the features normally associated with promoters (75
,105
,106
) and, somewhat surprisingly, may be more highly conserved than the promoters of protein-coding genes (75
). A recent large-scale study of the binding sites for the transcription factors, Sp1, cMyc and p53, found that a large proportion (36%) correlate with ncRNA transcripts, a significant number of which are regulated in response to retinoic acid, leading to the general conclusion that the human genome contains comparable numbers of protein-coding and non-coding genes that are bound by common transcription factors and regulated by common environmental signals (106
).
Finally, an increasing number of ncRNAs have been shown to be functional, including the well-characterized ncRNAs Xist and Tsix that control X-chromosome inactivation in mammals (107
,108
). They also include a number of well-characterized antisense transcripts which appear to play regulatory roles in relation to their sense gene, including those opposite FGF-2 (fibroblast growth factor-2), HIF-1 (hypoxia inducible factor-1) and myosin heavy chain [for review see (109
)]. Increasing numbers of functional studies of ncRNAs are being conducted using ectopic expression and RNAi-mediated knockdowns. For example, ectopic expression of the murine brain-specific ncRNA SCA8, which has been implicated in Spinocerebellar Ataxia Type 8 (110
), under the control of a promoter specific to photoreceptors, results in late-onset, progressive neurodegeneration in the Drosophila eye (111
). Moreover, using this neurodegenerative phenotype as a sensitized background for a genetic modifier screen, mutations were identified in four genes, all of which encode neuronally expressed RNA binding proteins conserved in Drosophila and humans (111
). The knockdown by RNAi of a 6.7 kb spliced and polyadenylated murine ncRNA (TUG1) that is expressed in the retina and brain and upregulated by taurine in developing retinal cells RNA resulted in malformed or non-existent outer segments of transfected photoreceptors in mice (112
).
This approach has recently been extended into large-scale screening strategies of ncRNAs. Pairs of siRNAs directed against 512 ncRNA sequences from the RIKEN Fantom2 mouse cDNA collection (113
) were used to interrogate a battery of 12 cell-based reporter assays representing key cellular processes and signaling pathways (114
). Eight functional ncRNAs were identified (114
; J.B. Hogenesch and P.G. Schultz, personal communication), a good rate of return given the limited functional scope of the assays: six essential for cell viability, one repressor of Hedgehog signaling, and one (termed NRON) which acts as a repressor of the transcription factor NFAT, which itself is required for T-cell receptor-mediated immune response, and the development of the heart, vasculature, musculature and nervous tissue. NRON occurs as a variety of alternatively spliced transcripts ranging from 0.8 to 3.7 kb, and interacts with 11 different proteins, possibly as scaffolding for a complex including a translation initiation factor, RNA helicase and proteins involved in nucleocytoplasmic transport, proteolysis and signal transduction (114
).
The number of known functional ncRNA genes has risen dramatically in recent years and over 800 ncRNAs (excluding tRNAs, rRNAs and snRNAs) have been catalogued in mammals, at least some of which are alternatively spliced (115
,116
). ncRNAs have also been implicated in many diseases, including various cancers and neurological diseases (18
,115
).
There is a rapidly looming nomenclature problem for the large number of ncRNAs (117
), especially as the function and mode of action of the vast majority are unknown, and their complex structures and interlacing/overlapping nature make discrete classification difficult. As a considerable fraction of eukaryotic transcripts are spliced, most approaches used, including cDNA cloning, detect only portions of transcripts, which often correspond to exons. Depending upon the method used these detected sites of transcription have been called an assortment of terms, such as ditags, CAGE tags, transfrags and ESTs, to mention a few. In some cases, experiments are used to connect these fragments into full-length or near full-length transcript structures [see e.g. (79
)]. When transcripts are found to contain reduced protein-coding potential these have also been given various names including npcRNA (non-protein-coding RNA), utRNA (untranslated RNA) (117
) or TUF (transcript of unknown function) (77
). A structured system that may be used to catalog and refer to ncRNAs until they can be grouped and re-classified into recognized structural and/or functional classes is currently being considered by the HUGO Gene Nomenclature Committee (see http://www.gene.ucl.ac.uk/nomenclature/).
| SMALL REGULATORY ncRNAs |
|---|
|
|
|---|
The past few years have seen an explosion in the discovery of small regulatory RNAs in animals and plants (8
Small nucleolar RNAs
snoRNAs generally range from 60 to 300 nucleotides in length and guide the site-specific modification of nucleotides in target RNAs via short regions of base-pairing. There are two major classes, the box C/D snoRNAs which guide 2'-O-ribose-methylation, and the box H/ACA snoRNAs which guide pseudouridylation of target RNAs (36
,121
123
). Initially, it was thought that the role of snoRNAs was restricted to rRNA modification in ribosome biogenesis, but it is now evident that they can target other RNAs, including snRNAs and mRNAs (36
,64
,121
123
). Most mammalian snoRNAs come from the introns of either protein-coding or non-coding genes (124
) but apparently some human C/D snoRNAs are independently transcribed as indicated by the presence of methylated guanosine caps at their 5' ends (125
). Although the snoRNAs involved in ribosome biogenesis are located in the nucleolus where this type of ncRNA was first characterized (hence their name), a subset of H/ACA snoRNAs is located in Cajal bodies (a class of small nuclear organelle) and are sometimes called scaRNAs (small Cajal body RNAs) (36
). Telomerase RNA is also found in Cajal bodies in a cell-cycle dependent manner (126
,127
).
At least some snoRNAs exhibit tissue-specific and developmental regulation, and/or imprinting (101
,102
,128
,129
), indicative of a regulatory function. There are also a number of so-called orphan snoRNAs without known targets (101
,102
,123
,128
,130
,131
). As noted earlier, one of these snoRNAs is linked to the aberrant splicing of the serotonin receptor 5-HT(2C)R gene in PraderWilli syndrome patients (64
,102
). It is also evident that there are many other snoRNAs, as well as likely, other as yet functionally uncharacterized classes of small regulatory RNAs, that have yet to be discovered (36
,132
).
MicroRNAs and small interfering RNAs
miRNAs and siRNAs are short, approximately 22 nucleotides long RNA molecules derived either from hairpin or double-stranded RNA precursors. Details of miRNA and siRNA biology and biochemistry can be found in a number of recent reviews (8
,133
135
). miRNAs suppress translation via non-perfect pairing with target mRNAsusually involving a seed pairing of just six to eight nucleotides in length (56
)or (as with siRNAs) cause degradation of target RNAs by the RISC complex in the case of perfect complementarity with the target sitethe phenomenon known as RNAi. It is estimated that approximately one-third of human protein-coding genes are controlled by miRNAs [reviewed in (119
)]. In addition, siRNAs derived from repeats participate in the establishment of silenced (heterochromatic) chromatin, as well as in other aspects of chromosome dynamics, phenomena best studied in yeast [for reviews see (8
,136
)].
miRNAs are derived from the introns and exons of both protein-coding and non-coding transcripts that are synthesized by RNA polymerase II (8
,137
,138
). It has also recently been shown that a number of mammalian miRNAs are derived from repeats, mainly various transposons (139
), which may lead to a re-examination of the functional role of transposons, especially since it also appears that transposon sequences can play a significant role in the developmental processes and epigenetic variation (140
,141
). Some miRNAs also appear to be derived from processed pseudogenes (142
).
The expression of many miRNAs is regulated and miRNAs have been shown to be central to a wide range of developmental processes, including developmental timing, cell proliferation, leftright patterning, neuronal cell fate, apoptosis and fat metabolism [for reviews see (8
,133
135
,143
)], as well as neuronal gene expression (144
), brain morphogenesis (145
), muscle differentiation (146
) and stem cell division (147
). Not surprisingly, therefore, alterations in the expression, sequence or target sites for miRNAs may be a significant but hitherto unrecognized source of human genetic disease, including cancer. Sequence variants in the binding site for the miRNA miR-189 in the SLITRK1 mRNA have recently been shown to be associated with Tourette's syndrome (148
). miRNA expression is dysregulated in cancer cells (143
,149
,150
) and miRNA profiling can be used as a very accurate diagnostic tool for cancer classification (151
,152
). The proto-oncogene c-Myc has been shown to activate expression of an miRNA cluster on human chromosome 13, and two miRNAs (miR-17-5p and miR-20a) from this cluster downregulate expression of the transcription factor E2F1 that activates cell cycle progression (153
). Enforced expression of the same miR-17-92 miRNA cluster has also been shown to promote tumor development (154
), as has misexpression of the Drosophila miRNA mirvana/mir-278 (155
), indicating that some miRNAs may also function as proto-oncogenes.
Until recently, it was believed that the post-transcriptional suppression of gene expression by miRNA in vertebrates occurs through translation suppression directed by a non-perfect duplex formed between miRNA and mRNA in the 3'-UTR. However, in 2004, two groups described suppression of HOX gene expression by mRNA degradation because of a perfect match between miRNA and mRNA in 3'-UTR (71
,72
). Another example of mRNA degradation because of a perfect match with a trans-acting miRNA has been reported for the imprinted Rtl1/Peg11 locus (73
). The maternally transcribed anti-Peg11 transcript is processed into several miRNAs, which cause RISC-mediated cleavage of paternally expressed Rtl1/Peg11 mRNA. Interestingly, the miRNAs are complementary to the coding region, not to the 3'-UTR (73
), indicating that miRNA target sites may be located anywhere in the transcript, and indeed in any functional transcript, not just mRNAs. In addition, it has recently been shown that certain miRNA precursors are edited by ADAR1 and ADAR2, resulting in both suppression of processing by Drosha, and degradation by Tudor-SN, which is a component of RISC (156
).
The miRBase database (http://microrna.sanger.ac.uk/) lists over 300 experimentally verified miRNAs in human as well as predicted miRNA target genes (157
). However, many more miRNAs have been identified computationally, with a proportion validated post hoc (158
). Most miRNA prediction methods rely on identification of a stable stemloop precursor and phylogenetic conservation [see e.g. (158
)]. However, these criteria may be far too narrow. Although many of the known miRNAs are highly conserved (and have been mainly identified on this basis), there is no reason why they all should be, as (as far as one can tell) these short RNAs have no intrinsic catalytic activity and function simply by target recognition, and thus should be able to evolve relatively quickly by co-variation with their targets, and by positive selection for new connections in regulatory networks underpinning adaptive radiation. Consistent with this, the known miRNAs appear to have many targets, thereby making co-variation difficult, and explaining their strong conservation, which in many cases surpasses that of protein-coding sequences (108
). A recent study that did not require substantial evolutionary conservation identified many new human miRNAs, a significant number of which appear to be primate-specific (159
).
The number of predicted human miRNAs is rising rapidly (8
,135
,159
). Sensitive genetic screens in C. elegans have also identified rare miRNAs with limited evolutionary conservation such as lys-6 which is required for leftright neuronal patterning (160
), suggesting that many miRNAs may be cell-type specific and that many more remain to be found.
| BIOLOGICAL ROLES OF ncRNAs |
|---|
|
|
|---|
As outlined earlier, ncRNAs are already known to fulfill a wide range of functions, including the control of chromosome dynamics, splicing, RNA editing, translational inhibition and mRNA destruction. It is obvious that we have only begun to explore the true extent of RNA regulation of these processes. It also appears that RNA may play a role in virtually all levels of gene regulation in eukaryotes.
A range of evidence suggests that RNA signaling underpins chromatin remodeling and epigenetic memory, although the mechanisms are unknown, and the matter is not without controversy [for reviews and discussion see (8
,18
,35
,161
163
)]. There is evidence that transcription from upstream regions can affect the expression of the adjacent gene, either by promoter interference (164
) or by altering chromatin structure (165
167
), leading to the hypothesis that it is the act of transcription which is responsible for the regulatory effects, and that the transcript itself (an ncRNA) is just a by-product (168
). However, it is hard to imagine how transcription per se could convey sufficient information to account for the precise and quite complex changes in histone modification and chromatin remodeling that are observed at most loci. Indeed, there are only a limited number of chromatin-modifying enzymes in animals, suggesting that these enzymes must be targeted to their sites of action, which vary at thousands of loci around the genome during differentiation and development, by another level of sequence-specific signals, most logically RNA. In agreement with this prediction small RNAs have been shown to induce transcriptional silencing and alterations to DNA methylation in human cells (169
,170
).
There are also good reasons to expect that splicing is regulated, at least in part, by trans-acting RNAs that guide splice site selection (8
,18
,64
,171
) or modify sequences around splice sites to render them accessible or otherwise to the splicing machinery (64
).
Evidence is also emerging that transcription itself may be regulated by ncRNAs (18
,163
). As noted earlier, RNA polymerase II itself appears to be regulated in part by ncRNA signaling (30
34
). A ncRNA has been reported to be required for the repression of RNA polymerase II-dependent transcription in primordial germ cells in Drosophila (172
). At least some transcription factors (and chromatin-modifying proteins) appear to have affinity for structures involving RNA (173
179
). A small double-stranded RNA termed NRSE activates transcription of neuron-specific genes (180
) and short artificial RNAs have been shown to inhibit transcription of targeted genes in the absence of concomitant DNA methylation, with considerable potential for therapeutic use (181
,182
). An interesting case is the steroid receptor RNA activator (SRA) which was originally described as functional non-coding RNA involved in the regulation of gene expression by steroid hormones (183
). The gene produces several transcripts of which one encodes a protein (184
) and both the ncRNA and its encoded protein affects the activity of estrogen receptor in breast cancer cells (185
). Recently, it was shown that pseudouridine synthase mPus1p (an enzyme that converts uridine to pseudouridine in RNA) is a coactivator for the retinoic acid receptor, which acts by pseudouridinilation of SRA RNA (186
). In addition, the thyroid hormone receptor has an RNA-binding domain which binds SRA, and the binding enhances expression of reporter genes (187
).
ncRNAs also play a role in stress responses. The small non-coding transcript B2 is produced by RNA polymerase III from murine short interspersed elements (SINE) under heat shock. The B2 RNA binds to RNA polymerase II and represses transcription after heat shock (188
,189
). In primates, RNA polymerase III also produces the brain-specific Alu-derived transcript BC200 (190
). Non-coding repetitive RNAs are also transcribed in stressed human cells and are localized in nuclear stress bodies that are assembled on specific pericentromeric heterochromatic domains that change their epigenetic status from heterochromatin to euchromatin in response to stress (191
). The non-coding RNA omega is among few heat-shock-inducible genes in Drosophila (192
), and although its exact role is unknown, it binds to a number of RNA-binding proteins involved in processing of nuclear RNA (hnRNPs complexes) (193
).
ncRNAs may also act as scaffolding for the assembly of macromolecular complexes. Examples include rRNA in ribosomes, the 7SL RNA in the SRP (40
), and possibly RNAs involved in the assembly of chromatin complexes (35
), as well as NRON, recently shown to interact with a number of proteins involved in nuclear transcription factor trafficking (114
).
| INTRONS AS A SOURCE OF FUNCTIONAL NCRNAS |
|---|
|
|
|---|
Introns account for at least 30% of the human genome and may be a significant, perhaps major, source of regulatory ncRNAs (17
| CONCLUSION |
|---|
|
|
|---|
We may have fundamentally misunderstood the nature of genetic programming in the higher organisms. It appears that the human genome and those of other complex organisms express an enormous repertoire of ncRNAs, and that their cells are awash with these RNAs, which constitute a hidden layer of molecular genetic signals. Although the functions of these RNAs are likely to be many and varied, both logic and evidence strongly suggest that their main role is to regulate and direct the complex pathways of developmental ontogeny, which must require enormous amounts of information in an organism as precisely sculptured as a human (13
The existence of a sophisticated RNA-based regulatory system would also largely explain the paradox of the tremendous diversity of characteristics observed among mammals and other complex organisms, despite the relative commonality of their proteomes. That such RNAs have remained hidden from view for so long appears to have been a consequence of their sheer numbers and population complexity which makes biochemical detection of individual sequences difficult, combined with the subtlety of their genetic signatures. Indeed, with few exceptions, until recently most known ncRNAs were those that are present in relatively large amounts, such as rRNAs, tRNAs and the common snoRNAs and snRNAs, and it has only been the combination of sensitive genetic screens (such as those that first identified miRNAs), large-scale cDNA and whole genome sequencing, new sensitive analytical methods (such as RTPCR and genome tiling arrays) and bioinformatics, based on clues from known examples, that has begun to reveal the true complexity of what lies under the surface.
It is also evident that many ncRNAs, including those of demonstrated functionality like Xist, are evolving quickly (108
). This rapid evolution has been considered as evidence of lack of functionality (204
). This may be incorrect, and these sequences may in fact be simply able to drift easily because of different constraints and/or be subject to positive selection related to phenotypic variation. Recent analyses of the Drosophila genome have indicated that, contrary to long-held expectation, a large fraction of the non-coding sequence is functionally important and subject to various levels of purifying selection and adaptive evolution (205
).
The extent of non-coding sequence conservation in mammals is also much higher than that of protein-coding sequences (202
,206
), perhaps as high as 10% by some estimates (207
