We have sequenced and compared DNA from the ends of three human chromosomes: 4p, 16p and 22q. In all cases the pro-terminal regions are subdivided by degenerate (TTAGGG)n repeats into distal and proximal sub-domains with entirely different patterns of homology to other chromosome ends. The distal regions contain numerous, short (<2 kb) segments of interrupted homology to many other human telomeric regions. The proximal regions show much longer (~10-40 kb) uninterrupted homology to a few chromosome ends. A comparison of all yeast subtelomeric regions indicates that they too are subdivided by degenerate TTAGGG repeats into distal and proximal sub-domains with similarly different patterns of identity to other non-homologous chromosome ends. Sequence comparisons indicate that the distal and proximal sub-domains do not interact with each other and that they interact quite differently with the corresponding regions on other, non-homologous, chromosomes. These findings suggest that the degenerate TTAGGG repeats identify a previously unrecognized, evolutionarily conserved boundary between remarkably different subtelomeric domains.
Subtelomeric sequences, characterized from a large number of organisms, contain a wide variety of repetitive DNA, ranging from low copy interspersed repeats found on a few chromosome ends, to highly repetitive sequences present in many subtelomeric regions. Tandem repeats are commonly found in many subtelomeric regions and some sequences which are present as a single-copy at one subtelomeric region are also found on many other chromosome ends (reviewed in 1 ,2 ).
Despite the complexity of subtelomeric regions, previous studies have revealed no conserved elements indicative of an important function. Furthermore, naturally occurring mutations in humans show that chromosomes without subtelomeric sequence can be inherited normally (3 ). Similarly, yeast chromosomes devoid of subtelomeric sequence do not appear to be adversely affected (4 ). Although mutation analysis to date has tested function only in limited ways, it suggests that subtelomeric regions may serve no important biological role other than to act as a buffer between functional telomeric repeats and the most distal genes on the chromosome. Thus the prevailing view is that the complex structure of subtelomeric regions simply reflects dispersal of multiple repetitive sequences between chromosomal ends by relatively unconstrained recombination and gene conversion.
The processes that drive the production and maintenance of subtelomeric repeats have been studied in most detail in yeast. Immediately proximal to the yeast telomere sequence (TG1-3) are a variable number of repeats known as Y' elements which fall into two major size classes of 5.2 and 6.7 kb. Frequent homologous recombination and gene conversion events have been documented between Y' elements that can account for the spread and variability of these yeast subtelomeric sequences (5 -7 ). In addition, the presence of a mitochondrial intron in yeast subtelomeric sequence (8 ) and copies of interstitially located genes (9 ,10 ) in the absence of flanking homologous sequences indicates that non-homologous recombination events may be relatively common in subtelomeric regions. Similar processes are thought to explain the distribution of subtelomeric repeats in other organisms (2 ).
However, it is not known whether sequence exchange across the subtelomeric region is random, or whether it is constrained in some way that reflects structural organization. One way to investigate this question is to search for conserved structural features within a single species and through inter-species comparisons. Currently, complete structural information is only available for yeast, in which the only conserved feature found in the 20 telomeres currently analysed is a 475 bp (X) element containing a putative origin of replication (autonomously replicating sequence, ARS) (1 ) and binding sites for two essential proteins (11 ,12 ).
We have searched for conserved structural features in human subtelomeric DNA and have sequenced and characterized 118 kb of the 4p telomere and 40 kb of subtelomeric DNA from 22q. We have previously sequenced 285 kb of the 16p telomere (13 ) and now report comparisons of the sequence organization of the three telomeres. In addition we have analyzed the telomeric regions of the yeast genome sequence to look for conserved features within yeast and between yeast and humans.
To characterize the 4p subtelomeric region we obtained 118 768 bp of sequence extending from the terminal (TTAGGG) repeats (accession no. Z95704). To compare this sequence with other subtelomeric regions we designed primer pairs that amplify DNA free of known repeats (Alu, L1, MER etc.) across the entire subtelomeric region. Using these primers we analysed DNA from 30 YACs that contain different human telomeres (14 ) and DNA from a somatic cell hybrid panel (UK HGMP resource centre). The results of these two PCR surveys are broadly in agreement: no sequence unique to chromosome 4p is found within 60 kb of the telomere. Combining existing hybridization data (15 ) with the PCR analysis we find that the 4p subtelomeric sequence extends to a region between co-ordinates 58 321 and 60 560 that contains an L1 repeat.
The distribution of matches to other chromosomes across the ~60 kb of 4p subtelomeric sequence, as judged by PCR analysis, is not random (Fig. 1 ). Within the terminal 15 kb of 4p there are matches to 17 different telomeres whereas the more proximal 45 kb matches only chromosome 4q and four acrocentric p arms (13p, 15p, 21p and 22p) (15 ). Each sequenced PCR product from non-homologous chromosomes showed >80% similarity to the corresponding 4p sequence. However, the initial PCR analysis suggested that the matches might be complex. To investigate this further we looked for sequence matches between the 4p telomere and 661 subtelomeric sequences (average size 450 bp) that we have previously compiled from 31 telomeric YACs.
We next asked whether any other features might distinguish the distal from the proximal subtelomeric sequence of 4p. We analysed GC content, the frequency of di and tri-nucleotides, CpG islands, the presence of repetitive sequences, and screened DNA and protein databases using BLASTX and BLASTN. The proximal and distal subtelomeric sequence differed in two respects (summarized in Fig. 2 a). First, Alu family repeats are not randomly distributed throughout the sequence; they increase in frequency from 4.2% in the distal telomeric 15 kb to 7.6% in the centromeric subtelomeric sequence and to 10.2% in chromosome unique sequence. Second, distal subtelomeric DNA has a relative enrichment of partially matching ESTs, concentrated between co-ordinates 11 073 and 17 265. However, the genomic sequence contains no ORFs to match the ESTs and RT-PCR analysis failed to detect a transcript in three different tissues examined. By contrast EST matches in the proximal subtelomeric region did correspond to an ORF. The predicted protein includes zinc finger motifs encoded in chromosome unique sequence (see Fig. 2 a and legend).
A dot-plot comparison of the 16p and 4p subtelomeric sequence shows only two regions of similarity (data not shown). The most terminal 2400 bp of 4p and the terminal 1130 bp of 16p (allele A) have a similar repeated structure (though previous analyses have shown that this structure does not occur on all 16p alleles). A second region of similarity was found between co-ordinates 9518 and 9914 of 16p, and 13 188 and 13 322 of 4p. Here the 16p sequence contains 58 internal degenerate telomere repeats (TTAGGG) in the same orientation as the telomere, in a similar arrangement to those on 4p (Fig. 3 ). Furthermore, the internal telomere repeats are followed closely by matches to the consensus putative origin of replication (two on one strand, one on the other), between co-ordinates 11 441 and 12 004 (again with one mismatch to the consensus) (Table 1 ).
Remarkably, the internal TTAGGG repeats separate the 16p subtelomeric region into similar distal (first 11 kb) and proximal (11.0-36.8 kb) subtelomeric domains. Using primers from the 16p sequence to analyze telomere YACs and a somatic cell hybrid panel, we found matches to 19 telomeres in the distal subtelomeric sequence with a mosaic distribution, while the proximal subtelomeric sequence contained uninterrupted matches to 9q, 10p, 18p and the pseudo-autosomal region of Xq and Yq. Comparison of 16p sequence with the telomere sequence database revealed matches to five other telomeres in the distal subtelomeric region, one of which was a split match. As with the 4p sequence, distal subtelomeric sequence was characterized by a relative dearth of Alu sequences and a comparative richness of ESTs. No Alu family repeats occur in the distal subtelomeric sequence of 16p, but in the proximal subtelomeric sequence the frequency is 23%. ESTs were concentrated between 4088 and 8026, but none could be shown to be expressed by RT-PCR. These data are summarized in Figure 2 b.
Figure
The structural organization of 4p and 16p was compared with that of the 22q telomere. A cosmid that contains subtelomeric sequence from 22q has previously been reported and its telomeric location confirmed by physical mapping (22 ). The 22q terminus is situated <5 kb from the end of this clone which contains 40 kb of subtelomeric DNA. We obtained complete sequence of the cosmid and searched it for ESTs, matches to other telomere sequences and for putative boundary elements. The results are summarized in Figure 2 c.
The 22q subtelomeric region has the same structural organization as 16p and 4p. Again there is a concentration of ESTs and matches to other telomeres in the terminal region followed proximally by internal degenerate TTAGGG repeats. The matches to other chromosomes drops abruptly after the internal TTAGGG repeats: while there are 10 matches to other telomeres distal to the TTAGGG repeats, none are found proximal. Furthermore, as on 16p and 4p, some of the matches to other chromosome ends are split. We have not explored the distribution of matches to other chromosomes by PCR. Preliminary sequence data from cosmids extending from subtelomeric sequence into chromosome unique DNA on 22q (22 ) confirm that internal TTAGGG repeats and matches to other telomeres are restricted to the distal subtelomeric domain.
Figure
Again like the 4p and 16p sequence, the 22q subtelomeric sequence contains a match to the consensus putative origin of replication. However rather than lying proximal to the internal repeats (as on 4p and 16p), it is found 1.5 kb distal to the repeats. The match is shown in Table 1 , along with the sequences from 4p and 16p. Note that the 22q and 4p sequences have >95% similarity. All sequences have a single mismatch to the putative consensus. Overall therefore the structural organization of subtelomeric sequence on 22q is the same as 4p and 16p.
A sequence alignment of internal TTAGGG repeats from eight chromosomes is shown in Figure 3 . In addition to sequences from 4p, 16p and 22q, the figure shows sequence obtained by PCR from somatic cell hybrids (chromosomes 13, 15, 21 and 22) and from a 4q YAC. The subtelomeric location of the acrocentric chromosome sequences is inferred from their absence from YACs containing the long arm telomere, although an interstitial location cannot be excluded. As expected, the acrocentric sequences and 4q are >95% identical to 4p. Note that sequences adjacent to the internal TTAGGG repeats (apart from the putative origin of replication) are dissimilar in 22q, 16p and 4p, as shown by dot-plot comparisons.
To see whether the division of subtelomeric sequences into proximal and distal domains might be a general feature of eukaryotic chromosomes, we examined yeast subtelomeric regions. We performed pairwise comparisons of at least 30 kb of each yeast subtelomeric region and dot-plot comparisons of human and yeast subtelomeric regions. Figure 2 d contains a summary of the conserved features of the yeast subtelomeric region and the extent of matches between chromosome ends. The parallels with human subtelomeric regions are striking.
The yeast telomere consists of TG1-3 repeats. Immediately adjacent, are variable numbers of highly conserved Y' elements that are found at 17 of the 32 ends of the sequenced strain and can account for up to 30 kb of any particular end (23 ). Comparisons between Y' elements on different chromosomes show the same interrupted homologies that we observed in the 4p and 16p distal subtelomeric sequence (6 ). Centromeric to the Y' elements are highly variable short repeats [junction (J) elements in Fig. 2 d], some combination of which is found at most ends. The combination of the Y' elements and junction sequences creates a mosaic of short tracts of homology among all ends. These homologies are analogous to those observed in the human distal subtelomeric region. Y' elements are expressed at meiosis (1 ) and are therefore yeast ESTs which may be equivalent to those found at human telomeres.
Every chromosome end has a core X element centromeric to the Y' and J elements. Within the X element, at its border with the J sequences, is an array of degenerate TTAGGG repeats, the sequence of the vertebrate canonical telomere sequence, rather than to the yeast telomere sequence (TG1-3 repeats). Twenty-nine yeast telomeres contain at least one perfect copy of TTAGGG in the X element. Dot-plot comparisons reveal degenerate arrays in this vicinity at all yeast telomeres. Figure 4 shows a dot-plot comparison between the left end of yeast chromosome 1 and 16p. The repeats are in the same orientation as the actual yeast telomere and are followed proximally by an ARS element. The TTAGGG repeats and ARS element constitute the boundary between yeast distal and proximal subtelomeric sequences.
On the centromeric side of core X there are larger (6-30 kb) contiguous blocks of homology (>90%) shared by fewer ends, usually just two to three. These larger, less dispersed subtelomeric repeats are analogous to the proximal subtelomeric sequence observed in human chromosomes. Thus the general structure of yeast subtelomeric regions closely resembles that of the human subtelomeric sequences.
Figure
We have shown that there are conserved structural features between yeast and human subtelomeric regions. In both human and yeast a mosaic pattern of matches to many other chromosomes lies immediately adjacent to a discontinuous region of longer uninterrupted matches to a few chromosome ends. Between these two domains are blocks of degenerate TTAGGG repeats and sequences with homology to putative origins of replication. It is particularly noteworthy that the consensus sequence of vertebrate telomeres (TTAGGG) is present in yeast as a putative boundary between distal and proximal subtelomeric sequences.
Although we have not proven that the structural organization of 16p, 4p and 22q is common to all human chromosome ends, a combination of sequence data and PCR analyses make it likely that this is so. PCR analysis indicates that the structure of 4q, 13p, 15p, 21p and 22p is very similar to 4p and that the structure of 9q, 10p, 18p and XqYq is like that of 16p. Furthermore, complete sequence of the 4q and 10q subtelomeric regions confirms the 4p and 4q similarity and shows that 10q has the same structural features as 4q (Jane Hewitt, personal communication; the 4q sequence is available: DDBJ/EMBL/GenBank accession no. U74496).
The sequence comparisons described here suggest that the distal and proximal sub-domains do not frequently interact with each other and furthermore that they may interact differently with other non homologous chromosomes. The pattern of sequence similarities described in this paper cannot be attributed to random exchange processes for a number of reasons.
First, the consequence of simple recombination homogenization would be that all ends with proximal subtelomeric similarities should also share distal similarities. If the process of sequence exchange simply involved occasional unconstrained swapping of chromosome ends, sequence similarities between chromosomes would be seen as a continuous gradient with the greatest number of shared sequences at the distal ends and fewer at the proximal ends. In fact the number of yeast and human chromosome ends sharing homologies with a specific end is approximately constant up to the putative boundary region, and then drops sharply. Figures 5 a and 1 show the number of yeast ends with similarity to yeast chromosome XV R and the number of human ends with similarity to 4p (as assessed by PCR) plotted as a function of distance from the telomere. 16p and 22q show the same pattern as 4p, with an abrupt fall in the number of matches to other chromosome ends proximal to the internal TTAGGG repeats.
Second, if sequence similarities were a function of a decreasing level of exchange with increasing distance from the telomere, then sequence divergence should reflect this gradient. Remarkably, in yeast, the average sequence divergence actually peaks at the boundary region (the core X element) with a sharp drop to near identity on either side. Figure 5 b shows that there is an average 10.5% divergence for the 31 other ends relative to the XV R X element. On either side the divergence is 0.1%, gradually increasing to 2.5-4% as distance away from X increases. It seems that recombinational homogenization occurs preferentially either side of the boundary region, which itself may be a barrier to such processes.
Third, unconstrained sequence exchange cannot explain the presence of an identical sequence motif (TTAGGG1-3) in yeast and human chromosomes. In yeast the TTAGGG motif is a binding site for TBF1, the function of which is unknown (11 ,24 ). Core X also contains an Abf1p binding site, an essential gene that is involved both in transcriptional regulation and replication (12 ). It is possible that the human subtelomeric region also contains unrecognized protein binding sites. The detection of the putative consensus for an origin of replication is suggestive in this respect.
Fourth, the mechanisms of sequence homogenization appear different on either side of the putative boundary. Interrupted matches are restricted to the distal human sub-telomeric region and are unlike those produced by known mechanisms of sequence homogenization. Indeed nothing similar has been reported in analysis of meiotic recombination hotspots (25 ) nor can retroelements (of either long terminal repeat or poly-A subtypes) explain the pattern of matches observed. While retro-elements may incur 5' and 3' truncations, they are not known to produce split matches of the sort described in the human subtelomeric regions (26 -28 ). Similarly, in yeast, the distribution of many short, interrupted and polymorphic matches (6 ) is inconsistent with the action of a simple homogenization process (23 ).
By contrast, sequence exchange in the proximal region seems to be accounted for by conventional recombination processes as evidenced by well characterized rearrangements in a variety of species including yeast (1 ), human (29 ) and malaria (30 ). The observation of the action of different processes in the two subtelomeric domains argues against the action of random sequence exchange processes.
Therefore, it appears from our comparison of the human (4p, 16p and 22q) sequences with yeast chromosomes that degenerate interstitial TTAGGG sequences and a consensus sequence for putative origins of replication mark a boundary between proximal and distal subtelomeric domains. At present the nature and role, if any, of the boundary are not clear. It is possible that this region sub-compartmentalizes the distal and proximal subtelomeric domains in the nucleus limiting the extent of their interactions with other non-homologous chromosomes, and determining the nature of their interaction via the proteins with which they are sequestered. This hypothesis is supported by preliminary data showing that the X element is a barrier to recombination: deletion of the X element results in increased exchange between the distal and more proximal sequences (F.E. Pryde, T.C. Huckle and E.J. Louis, unpublished data). A model to explain our findings is given in Figure 6 . Further analyses will be necessary to understand the mechanisms underlying the striking compartmentalization of subtelomeric regions from such widely divergent species.
The 4p telomere has previously been cloned into a YAC (Y88BT) and partially subcloned into bacteriophage vectors ([lambda]A, [lambda]14, [lambda]18 [lambda]N, [lambda]53) (Fig. 1 ) (15 ). The terminal 8 kb were obtained by digesting the total YAC DNA with BAL31, ligating to SmaI digested Sup F plasmid pSD followed by partial digestion with EcoRI, ligation into the lambda bacteriophage vector EMBL3A and suppressor selection on MC1061/p3 (31 ). The library was screened with the pHutel probe (a gift from Dr Sally Cross) and 11 positive clones isolated. Restriction analysis showed that clone 12.3 contained the terminal 8 kb of the YAC which was subcloned into bluescript: P6 (terminal 6kb BamHI fragment) and P2 (adjacent BamHI/EcoRI fragment). PCR analysis showed that P6 and P2 were contiguous. To clone the region between the plasmid P2 and bacteriophage [lambda]A an EMBL3A bacteriophage library constructed from an MboI partial digest of Y88BT was screened with probes from P2 and [lambda]A. The phage 4PTEL1 was thus isolated. A gap between [lambda]N and [lambda]53 was closed by direct sequencing of PCR products from primers designed from the end of each bacteriophage. The sequenced contig also included the cosmid B31 (32 ). All clones were sequenced by an M13-based shotgun strategy as previously described (33 ). The 16p telomere sequence has been described previously (13 ). To obtain sequence from other telomeres, cosmid libraries were constructed from MboI partial digests of 30 YACs containing human telomeres. Cosmid DNA from each YAC was pooled, sonicated and a 1-2 kb fraction cloned into M13. Both ends of ~96 M13 clones were sequenced from each cosmid library. Sequence from the 22q cosmid was also obtained by a random shotgun strategy and is available electronically from http://www.genome.ou.edu. The cosmid DNA was isolated free from Escherichia coli host contamination by the cleared lysate, diatomacious earth-based procedure (34 ) and its sequence was determined at a level of 4.5-fold redundancy via a double-stranded, shotgun-based approach (35 ) using the ABI PRISIM fluorescent-labeled terminators and either forward or reverse universal primers. Sequencing vector regions were removed from the individual sequence reads and the resulting sequences were assembled into contiguous fragments initially using the TED and XGAP programs (36 ) and more recently using the Phred, Phrap, Consed programs (B.Ewing, D.Gordon and P.Green, http://www.genome.washington.edu/UWGC). The individual contigs were joined into a final, unique sequence using custom, synthetic primers and Taq DNA polymerase cycle sequencing with fluorescent terminators.
Genomic DNA was extracted from EBV-transformed cell lines by standard procedures. cDNA was prepared from HeLa, fibroblast and EBV-transformed cell lines by RT-PCR as described in (37 ). Northern and Southern blots were performed as described in (38 ). A mono-chromosomal somatic cell hybrid panel was acquired from the UK HGMP resource centre. The sequences of primers and PCR conditions are available from J.Flint (jf{at}worf.molbiol.ox.ac.uk).
After excluding Alu and other repetitive DNA, sequences were screened against dbEST and EMBL with BLASTN, and against a non-redundant compilation of Swissprot, PIR and wormpep (Caenorhabditis elegans genes) with BLASTX. BLAST output was filtered using MSPcrunch (39 ), requiring a minimum of 90% identity for dbEST and EMBL matches. Sequence was viewed from within an ACEDB database. BLASTN/MSPcrunch was also used to identify sequence matches between telomeres. Yeast telomere sequences were obtained during the yeast genome sequencing project and additional sequences for making the subtelomeric contigs from each end were obtained from the ftp site at MIPS (Martinsreid Institute of Protein Sequences). These were compared against each other in pairwise fashion using FASTA from the GCG package (40 ). Significant homologies were then aligned using MegAlign from the DNASTAR(TM) Lasergene software package.
This work was supported by the Wellcome Trust, the European Union sequencing project and the National Human Genome Research Institute at the NIH. We thank Dr J.Hancock for help in sequence analysis and Rhona Borts for commenting on the manuscript, and David Weatherall for support and encouragement.
Human Molecular Genetics
Pages
Introduction
Results
Analysis of the 4p sequence
Sequence characteristics of distal and proximal subtelomeric sequence
Comparison between 4p and 16p
Comparison between 4p, 16p and 22q
Comparison between yeast and human subtelomeric structure
Discussion
Materials And Methods
Contig assembly and sequence determination of the 4p and other telomeres
DNA/RNA sources and analysis
Data analysis
Acknowledgements
References
REFERENCES
This page is maintained by OUP admin. Last updated Tue Jul 15 11:13:42 BST 1997. Part of the OUP Journals World Wide Web service. Copyright Oxford University Press, 1996

