Human Molecular Genetics, 2002, Vol. 11, No. 6 669-674
© 2002 Oxford University Press
Classification of common conserved sequences in mammalian intergenic regions
National Center for Biotechnology Information, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892, USA
Received November 7, 2001; Revised and Accepted January 14, 2002.
| ABSTRACT |
|---|
|
|
|---|
Comparisons between orthologous intergenic regions of related genomes reveal numerous hits, i.e. pairs of relatively short highly similar sequences that evolved slowly, perhaps due to selective constraint. We analyzed and classified 2638 hits found within 100 pairs of complete, orthologous intergenic regions of human and murine genomes. We identified all common fragments of hits that align well with many other hits and constructed their classification. Our analysis revealed 20 abundant classes each containing 10 or more fragments. Fragments of the same class may perform the same function, e.g. bind a particular protein. Ten of the abundant classes apparently correspond to known functional consensuses, whereas others may represent novel conserved sites. Thus, large-scale comparative analysis of slowly evolving intergenic sequences can provide valuable insights into their function.
| INTRODUCTION |
|---|
|
|
|---|
The level of similarity between related eukaryotic genomes varies drastically along their alignments. Hits, i.e. pairs of highly similar orthologous segments of genomes, often with length of approximately 100 nucleotides, alternate with dissimilar segments of sequences, interhits (14). Presumably, slowly evolving sequences constituting hits are under selective constraint due to their conservative functions. Locations and sequences of hits are often the main source of information on otherwise poorly understood functioning of non-coding DNA (57).
A hit can be represented by a local alignment, or by a high-quality region within a global alignment, of orthologous DNA sequences from different genomes (8,9). Even within a hit, up to 50% of nucleotides can be different between the species. Accumulation of data on genome sequences has already led to the discovery of thousands of hits, and the total number of hits within a pair of moderately related eukaryotic genomes, such as human and murine, must be in the millions (4). Different hits can do different things and be dissimilar from each other, but can also possess substantial similarity.
Thus, it makes sense to compare and classify hits. A useful classification of hits may offer important insights into functioning of DNA sequences that constitute hits. Hits from the same class are likely to perform the same or similar function, e.g. to interact with a particular regulator of transcription. Ideally, classification of hits should be related to experimental data on expression of the flanking genes.
Here we report the results of comparative analysis of a set of 2638 hits which have recently being identified within alignments of 100 pairs of complete orthologous intergenic regions of human and murine genomes (4). We were unable to obtain a meaningful classification of whole hits since two hits are almost never similar to each other throughout their total lengths. Perhaps this is not surprising, since humanmouse hits cover
20% of their intergenic regions (4), so that the enrichment of hits by interesting sequences, although substantial (10), is not enough to strongly influence hits as a whole.
In contrast, we were able to obtain an apparently meaningful classification of short, common fragments of hits, i.e. their parts which align well to many other hits. Since conserved sequences that take part in regulation of transcription and are disproportionally common within hits are often rather short (10), classifying short fragments of hits makes more sense biologically. Of course, one can try to infer the function of the whole hit from the positions of its common fragments within this classification.
We performed pairwise comparisons between hits by making aligning alignments representing individual hits (11). This approach is superior to aligning individual DNA sequences covered by hits, since it utilizes information about interspecies similarity within a hit. Alignment of alignments is commonly used for comparing proteins but has apparently never being applied to the analysis of intergenic DNA.
| RESULTS |
|---|
|
|
|---|
The alignment that represents a particular hit was successively aligned with alignments representing all other 2637 hits and a histogram showing how many times each site within the hit was overlapped by alignments with other hits was constructed. Under reasonably stringent parameters of alignment, most sites of most hits were covered only 03 times (background noise), but some sites were covered many times. Histograms obtained by aligning individual sequences constituting hits showed a much worse discrimination between background noise and common sites (Fig. 1).
|
Within an individual hit, patterns of coverage obtained by aligning alignments and by aligning individual sequences are often rather different (Fig. 2). Naturally, aligning alignments leads to higher coverage of the more conserved parts of hits and, thus, probably reveals fragments of hits that are most relevant to its function, as such fragments must be more conserved within the hit.
|
In total, we defined 1662 fragments of hits that are common within the set of all hits (Materials and Methods). These were subdivided into 909 non-overlapping classes, 20 of which each contained at least 10 fragments (Table 1; for complete data see ftp://ftp.ncbi.nih.gov/pub/kondrashov/Classes/). To assay a statistical significance of our classification, we randomized the sequence of each fragment and classified 10 independent sets of 1662 randomized fragments. On average, a classification of randomized fragments contains less than three classes with 10 or more fragments in each (Fig. 3), and all these abundant classes consist of low-complexity fragments (data not shown). Thus, our classes with 10 or more fragments appear to be statistically significant.
|
|
A consensus sequence of each of the 20 abundant classes was compared to known regulatory signals from TRANSFAC and TRRD databases (12,13). Consensuses of 10 classes were found to be similar to known motifs of regulatory factors and signals (Table 1).
Consensus of the largest class 1 is similar to the JelinekSchmid hairpin, which is a structural element located between box A and box B in the polIII promoter. This region is conserved in modern Alu sequences but was apparently under strong selective pressure to mutate in early monomeric Alu-like genes (14).
Consensus of class 2 is similar to the CCa/tGGTCTACt/a motif (Fig. 4) which is homologous to a Pax-6-binding sequence found in rodent B1 repetitive elements (15). It takes only one mutation to create a strong Pax-6-binding site in a null B1 element. B1 has the potential to recruit gene targets for Pax-6 if it is inserted into their regulation regions, and therefore transpositions of B1 may play a role in evolution of regulation by Pax-6 (15).
|
Consensus of class 3 is similar to the GGCAGA motif. Repetitive occurrences of this motif, known as MMS10, are present in rodents within many loci (16). Also, this sequence was found in thyroid hormone response element (17) and in transcriptional repressor of MYC promoters (18). GAGGC motif from the consensus of class 3 is known as the multiple binding site for polyoma virus large T antigen (19).
Four consensuses similar to known motifs are low-complexity sequences. Consensus of class 4 is GA-rich and similar to consensus of GKL factor (20). Consensus of class 6 is GT-rich and similar to the binding site of the myeloid zinc finger protein MZF1. Both DNA-binding domains of MZF1 contain a core of four or five guanine residues, reminiscent of an NF-
B half site (21). Consensus of class 7 is AC-rich and similar to the sex-determining region Y (SRY)-binding sequence (22). Consensus of class 9 is similar to the MAZ1 binding site (23).
Finally, consensus of class 12 is similar to Sp1 (24) and consensus of class 18 is similar to CDE, the regulatory element of cell-cycle-regulated promoter (25). Consensuses of the remaining 10 classes were not listed in TRANSFAC and TRRD databases, and may represent novel functional sites.
| DISCUSSION |
|---|
|
|
|---|
We classified 1662 fragments of non-coding mammalian hits that are common within a large set of such hits. While a majority of 909 classes that we established contain only single fragments, 20 abundant classes each contain 10 or more fragments. The approach we used, aligning alignments (11), is not the only way of comparing a pair of hitsa multiple simultaneous alignment of all the four relevant sequences can also be tried (26). However, a hierarchical approach to aligning different hits is justified: sequences that constitute a hit and, thus, are compared first, are orthologous, whereas sequences from different hits are either paralogous or even phylogenetically unrelated. Thus, we can expect a pairwise alignment of pairwise alignments to work better than multiple alignment, at least when we know nothing a priori about similarity of function of different hits. In contrast, in the case of a set of hits all known to perform the same function, multiple alignment of all the sequences that are covered by these hits might be the best way to find the corresponding consensus.
Since it is formally possible to classify any set of objects, a legitimate question is whether our classification makes any biological sense. We believe it does, for three reasons.
First, variability of transcription regulation of different genes is, to a large extent, due to combinatorics within a relatively small repertoire of transcription factors (27,28). Thus, many non-coding hits are expected to perform the same or similar function (5,10). Although different binding sites of the same regulator may have substantially different sequences, each of them usually contains at least a short sequence similar to some consensus (7). Thus, similarity-based classification of hits can produce classes of functionally similar sites.
Secondly, over-represented fragments of hits are often so common (Fig. 1) that this obviously cannot be due to chance. In some cases, over-representation may be due to patterns in mutation. In particular, repeat expansion produces numerous low-complexity periodic sequences (29). However, although consensuses of several of our 20 abundant classes are low-complexity, none of them are simple two- or three-nucleotide repeats (Table 1) which are the most typical microsatellites. Other abundant classes have no obvious features making them distinct at the DNA level and are probably common due to selection.
Thirdly, 10 out of the 20 abundant classes in our classification have consensuses similar to sequences with known regulatory functions. Of course, we cannot be certain that in each case this implies that all the common fragments from a particular class indeed possess the corresponding regulatory function. Caution is particularly in order when the consensus of a class and the corresponding known regulatory sequence are low-complexity. However, the overall correspondence of the abundant classes and regulatory sequences can hardly be a coincidence.
It would be interesting to relate our classification to experimental data on regulation of transcription of genes that flank the 100 intergenic regions we studied. Unfortunately, there are currently not enough data on the regulators of transcription of these genesand for some genes which are better studied, no orthologous sequence of complete adjacent intergenic region is known. However, this situation will change in the near future, making direct experimental verification of classifications like ours feasible.
Interestingly, the three largest classes have consensuses similar to parts of transposable elements (TEs). This suggests that sequences derived from TEs may play a significant role in mammalian gene regulation (14,30). Although TEs that can be recognized by RepeatMasker, which typically means >50% similarity to a TE consensus in a sequence of more than 50 nucleotides, were masked before the construction of hits (4), shorter sequences of apparent TE origin were not excluded.
An intriguing possibility is that fragments of hits that constitute our most abundant class, the one that corresponds to JelinekSchmid hairpin, play a role in transcription of non-coding RNAs which appear to be ubiquitous in eukaryotic genomes (31). Unfortunately, ab initio prediction of non-coding genes is currently unreliable, and we cannot find any experimental evidence (positive or negative) of such genes within the genome regions that we studied. Obviously, fragments of hits from the JelinekSchmid hairpin class can be viewed as probable sites for non-coding genes.
Consensuses of some classes have similarities with two or more different regulatory signals. For example, we found overlapping Sp1 and AP2 binding sites in our set, as shown before for lens-specific MIP (24). There are some G-rich classes similar to the CTCF DNA target sites which have the wide range of dissimilar G-rich sequences; for example, class 4, class 12 and class 17. CTCF is an evolutionarily conserved zinc finger phosphoprotein that binds through combinatorial use of its 11 fingers to
50 bp target sites that have remarkable G-rich sequence variation. Formation of different CTCFDNA complexes, some of which are methylation sensitive, results in distinct functions, including gene activation, repression, silencing and chromatin insulation (18).
We believe that classification of hits, especially when many more of them become available due to comparison of complete genomes, will help to discover new regulatory sequences. It may be worthwhile to study experimentally the consensuses of 10 classes for which we could find no similarity with the known regulatory signals. With the accumulation of data on intergenic regions, classification of their conserved segments must become a routine part of comparative genomics, similar to the analysis of paralogous proteins.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Sequences and hits
We aligned 100 pairs of complete orthologous intergenic regions of human and murine genomes (4). These intergenic regions have average lengths of
10 000 (mouse) and
13 000 (human) nucleotides and are mostly untranscribed, because the putative sites of initiation and termination of transcription are usually close to the first or the last coding exon of a gene and the typical lengths of 5'- and 3'-non-coding regions of mammalian mRNAs is
1000 nucleotides (32). Since most of human and murine intergenic sequences are not alignable (2), alignments of their complete intergenic regions are, in fact, successions of local alignments of similar segments of sequences (hits), alternating with dissimilar interhits. A hit is an unambiguous local alignment with interspecies similarity >50%. In contrast, similarity within interhits generally does not exceed that between random sequences. Aligning was performed by Lamarck software tool (4,33) which assembles local alignments from dense diagonals corresponding to successive matches. Our alignments contain 2638 hits, with an average length of 134 nucleotides.
Aligning alignments representing hits
There may be several ways of aligning alignments (11). We were using, perhaps, the simplest approach, and assumed that only matches of matches indicate similarity between alignments. Formally, an alignment corresponding to a hit is represented by a string within a five letter alphabet. Letters a, t, g and c represent nucleotide matches (aa, tt, gg and cc), and letter n represents any aligned pair of different nucleotides, or any nucleotide aligned against a gap. Aligning these strings, we considered every pair of letters that contains at least one n (including nn), as well as any pair of different letters, as mismatches. We constructed local alignments of hits using the Lamarck algorithm (4,33) with the following parameters: minimal number of the consecutive matches, 3; mismatch penalty, 1; gap initiation penalty, 2; and gap elongation penalty, 1; and counted all alignments with scores of 10 or more. When, for comparison, individual sequences from hits were aligned (Figs 1 and 2), the minimal number of consecutive matches was six and the minimal score of an alignment was 16.
Finding common fragments of hits
For each site within the alignment representing a hit, we counted the number of alignments with alignments representing other hits that overlap it. We then identified common fragments within the hit, defined as at least six successive sites each overlapped by at least 10 alignments with other hits. These parameters were chosen because most regulatory signals have conservative cores of six or more nucleotides, and 10 is a coverage that is sufficiently above the background noise under our parameters of aligning alignments.
Classifying common fragments of hits
We constructed all pairwise alignments of alignments corresponding to common fragments of hits. Two fragments were considered similar if S/Ll > 0.6 and S/Ls > 0.8, where S is the score of their alignment, and Ll and Ls are the lengths of the alignments corresponding to the longer and the shorter fragments, respectively. After this, a classification was constructed by single linking clustering: if a and b, and b and c are similar, a, b and c were attributed to the same class, regardless of similarity between a and c.
Two criteria must be met in order for our class to be declared similar to an entry from TRANSFAC or TRRD databases (www.itba.mi.cnr.it/tradat): (i) the consensus of this class must exactly match one of actual regulatory signals presented in the databases and (ii) this consensus must be
80% similar to the consensus of the regulatory signal presented in the databases.
| FOOTNOTES |
|---|
+ To whom correspondence should be addressed. Tel: +1 301 594 5693; Fax: +1 301 480 2918; Email: shabalin@ncbi.nlm.nih.gov
| REFERENCES |
|---|
|
|
|---|
1 Hardison,R.C., Oeltjen,J. and Miller,W. (1997) Long humanmouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res., 7, 959966.
2
Jareborg,N., Birney,E. and Durbin,R. (1999) Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res., 9, 815824.
3 Shabalina,S.A. and Kondrashov,A.S. (1999) Pattern of selective constraint in C. elegans and C. briggsae genomes. Genet. Res., 74, 2330.[ISI][Medline]
4 Shabalina,S.A., Ogurtsov,A.Y., Kondrashov,V.A. and Kondrashov,A.S. (2001) Selective constraint in intergenic regions of human and mouse genomes. Trends Genet., 17, 373376.[ISI][Medline]
5 Hardison,R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet., 16, 369372.[ISI][Medline]
6 Wasserman,W.W., Palumbo,M., Thompson,W., Fickett,J.W. and Lawrence,C.E. (2000) Humanmouse genome comparisons to locate regulatory sites. Nat. Genet., 26, 225228.[ISI][Medline]
7 Duret,L. and Bucher,P. (1997) Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol., 7, 399406.[ISI][Medline]
8
Schwartz,S., Zhang,Z., Frazer,K.A., Smit,A., Riemer,C., Bouck,J., Gibbs,R., Hardison,R. and Miller,W. (2000) PipMakera web server for aligning two genomic DNA sequences. Genome Res., 10, 577586.
9
Kent,W.J. and Zahler,A.M. (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsaeC. elegans genomic alignment. Genome Res., 10, 11151125.
10
Levy,S., Hannenhalli,S. and Workman,C. (2001) Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics, 17, 871877.
11 Kececioglu,J.D. and Zhang,W.Q. (1998) Aligning alignments. In Farach,M. (ed.), Proceedings of CPM 1998, Lecture Notes in Computer Science 1448. Springer Verlag, Berlin, Germany, pp. 189208.
12
Wingender,E., Chen,X., Fricke,E., Geffers,R., Hehl,R., Liebich,I., Krull,M., Matys,V., Michael,H., Ohnhauser,R. et al. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res., 29, 281283.
13
Kolchanov,N.A., Podkolodnaya,O.A., Ananko,E.A., Ignatieva,E.V., Stepanenko,I.L., Kel-Margoulis,O.V., Kel,A.E., Merkulova,T.I., Goryachkovskaya,T.N., Busygina,T.V. et al. (2000) Transcription regulatory regions database (TRRD): its status in 2000. Nucleic Acids Res., 28, 298301.
14 Jurka,J. (1995) Origin and evolution of Alu repetitive elements. In Maraia,R.J. (ed.), The Impact of Short Interspersed Elements (SINEs) on the Host Genome. R.G. Landes Company, Austin, TX, pp. 2541.
15 Zhou,Y., Zheng,J.B., Gu,X., Li,W. and Saunders,G.F. (2000) A novel Pax-6 binding site in rodent B1 repetitive elements: coevolution between developmental regulation and repeated elements? Gene, 245, 319328.[ISI][Medline]
16 Bois,P., Williamson,J., Brown,J., Dubrova,Y.E. and Jeffreys,A.J. (1998) A novel unstable mouse VNTR family expanded from SINE B1 elements. Genomics, 49, 122128.[ISI][Medline]
17 Sap,J., de Magistris,L., Stunnenberg,H. and Vennstroem,B. (1990) A major thyroid hormone response element in the third intron of the rat growth hormone gene. EMBO J., 9, 887896.[ISI][Medline]
18 Ohlsson,R., Renkawitz,R. and Lobanenkov,V. (2001) CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease. Trends Genet., 17, 520527.[ISI][Medline]
19
Cowie,A. and Kamen,R. (1984) Multiple binding sites for polyomavirus large T antigen within regulatory sequences of polyomavirus DNA. J. Virol., 52, 750760.
20
Shields,J.M. and Yang,V.W. (1998) Identification of the DNA sequence that interacts with the gut-enriched Kruppel-like factor. Nucleic Acids Res., 26, 796802.
21
Morris,J.F., Hromas,R. and Rauscher,F.J.,III (1994) Characterization of the DNA-binding properties of the myeloid zinc finger protein MZF1: two independent DNA-binding domains recognize two DNA consensus sequences with a common G-rich core. Mol. Cell. Biol., 14, 17861795.
22 Pontiggia,A., Rimini,R., Harley,V.R., Goodfellow,P.N., Lovell-Badge,R. and Bianchi,M.E. (1994) Sex-reversing mutations affect the architecture of SRY-DNA complexes. EMBO J., 13, 61156124.[ISI][Medline]
23
Bossone,S.A., Asselin,C., Patel,A.J. and Marcu,K.B. (1992) MAZ, a zinc finger protein, binds to c-MYC and C2 gene sequences regulating transcriptional initiation and termination. Proc. Natl Acad. Sci. USA, 89, 74527456.
24
Ohtaka-Maruyama,C., Wang,X., Ge,H. and Chepelinsky,A.B. (1998) Overlapping Sp1 and AP2 binding sites in a promoter element of the lens-specific MIP gene. Nucleic Acids Res., 26, 407414.
25 Dohna,C.L., Brandeis,M., Berr,F., Mossner,J. and Engeland,K. (2000) A CDE/CHR tandem element regulates cell cycle-dependent repression of cyclin B2 transcription. FEBS Lett., 484, 7781.[ISI][Medline]
26
Stojanovic,N., Florea,L., Riemer,C., Gumucio,D., Slightom,J., Goodman,M., Miller,W. and Hardison,R. (1999) Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res., 27, 38993910.
27 Carey,M. and Smale,S. (1999) Transcriptional Regulation in Eukaryotes. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
28 Pilpel,Y., Sudarsanam,P. and Church,G.M. (2001) Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet., 29, 153159.[ISI][Medline]
29
Barry,A.E., Howman,E.V., Cancilla,M.R., Saffery,R. and Choo,K.H. (1999) Sequence analysis of an 80 kb human neocentromere. Hum. Mol. Genet., 8, 217227.
30
Donnelly,S.R., Hawkins,T.E. and Moss,S.E. (1999) A conserved nuclear element with a role in mammalian gene regulation. Hum. Mol. Genet., 8, 17231728.
31 Mattick,J.S. (2001) Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep., 2, 986991.[ISI][Medline]
32
Makalowski,W. and Boguski,M.S. (1998) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl Acad. Sci. USA, 95, 94079412.
33
Wolf,Y.I., Rogozin,I.B., Kondrashov,A.S. and Koonin,E.V. (2001) Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res., 11, 356372.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. A. Bernat, G. E. Crawford, A. Y. Ogurtsov, F. S. Collins, D. Ginsburg, and A. S. Kondrashov Distant conserved sequences flanking endothelial-specific promoters contain tissue-specific DNase-hypersensitive sites and over-represented motifs Hum. Mol. Genet., July 1, 2006; 15(13): 2098 - 2105. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov "Genome design" model: Evidence from conserved intronic sequence in human-mouse comparison Genome Res., March 1, 2006; 16(3): 347 - 354. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. V Kryukov, S. Schmidt, and S. Sunyaev Small fitness effect of mutations in highly conserved non-coding regions Hum. Mol. Genet., August 1, 2005; 14(15): 2221 - 2229. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. CAVALIER-SMITH Economy, Speed and Size Matter: Evolutionary Forces Driving Nuclear Genome Miniaturization and Expansion Ann. Bot., January 1, 2005; 95(1): 147 - 175. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-M. Mallon, L. Wilming, J. Weekes, J. G.R. Gilbert, J. Ashurst, S. Peyrefitte, L. Matthews, M. Cadman, R. McKeone, C. A. Sellick, et al. Organization and Evolution of a Gene-Rich Region of the Mouse Genome: A 12.7-Mb Region Deleted in the Del(13)Svea36H Mouse Genome Res., October 1, 2004; 14(10a): 1888 - 1901. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Shabalina, A. Y. Ogurtsov, D. J. Lipman, and A. S. Kondrashov Patterns in interspecies similarity correlate with nucleotide composition in mammalian 3'UTRs Nucleic Acids Res., September 15, 2003; 31(18): 5433 - 5439. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Collins, M. E. Goward, C. G. Cole, L. J. Smink, E. J. Huckle, S. Knowles, J. M. Bye, D. M. Beare, and I. Dunham Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22 Genome Res., January 1, 2003; 13(1): 27 - 36. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







