| Human Molecular Genetics | Pages |
Sequence analysis of an 80 kb human neocentromere
Introduction
Results
Generation of the neocentromere (NC) sequence
Sequence composition, A + T content and EST matches
Pericentric DNA sequences and putative protein-binding motifs
[alpha]-Satellite, [beta]-satellite, [gamma]-satellite, classical satellites I and III and other pericentric sequences
CENPB, pJ[alpha], HMGI and topoisomerase II (topoII) protein-binding motifs
Tandem repeats
Human transposable elements
Discussion
Materials And Methods
Generation of the NC DNA sequence
Calculation of normal A + T content and expected abundance of motifs
Computational analysis
Acknowledgements
References
Sequence analysis of an 80 kb human neocentromere
DDBJ/EMBL/GenBank accession no. AF04284 (See Corrigenda).
INTRODUCTION
The centromere is an essential component of all eukaryotic chromosomes, and appears as a primary constriction on all metaphase chromosomes. It functions as the site for kinetochore assembly and spindle fibre attachment, allowing the faithful pairing and segregation of sister chromatids during cell division. An undefined interplay between the centromeric DNA and centromere-binding proteins presumably underlies the formation of the kinetochore complex to allow correct centromere function. An increasing number of kinetochore proteins have now been identified and shown to be highly conserved through evolution (1-4); however, the precise composition of the functional centromere DNA has so far eluded researchers. Despite the fact that in all higher eukaryotes this DNA is made up of highly reiterated satellite sequences, with these sequences generally having a high A + T content, its primary sequence is highly variable between species, making it difficult to correlate sequence conservation with function (5).
Normal human centromeres vary in size between 1 and 4 Mb and are largely composed of highly repetitive [alpha]-satellite sequences. This DNA consists of a 171 bp tandem repeat unit that contains numerous binding sites for centromere protein B (CENPB) (6,7) and pJ[alpha] (8-10). Other types of satellite DNA found at the pericentric regions of human chromosomes include classical satellites I, II and III, and [beta]- and [gamma]-satellites (4,11). Repeats found at or near the human centromere also include an AT-rich sequence (ATRS) (12) and a novel 48 bp repeat (13-15). Some interspersed human transposable elements have also been detected pericentrically (16,17). To date, of all the known centromeric/pericentric satellite DNA sequences, only [alpha]-satellite has been shown experimentally to exhibit functional centromeric activity (18-22).
In recent years, numerous morphologically abnormal marker chromosomes have been described whose centromeres are devoid of detectable levels of [alpha]-satellite DNA but remain functional in mitosis (5,23). These so-called analphoid ([alpha]-satellite-negative) marker chromosomes generally have lost their normal centromere through chromosomal rearrangements, leading to the formation of a new centromere or neocentromere at a previously non-centromeric region on the chromosome arm. These regions have been proposed to be the sites of latent centromeres that can become activated through unknown epigenetic mechanisms (5,24-27).
We previously described a chromosome 10-derived analphoid marker chromosome in a young boy with mild developmental impairment. This marker chromosome, designated mardel(10), has acquired neocentromere activity in a region corresponding to the q25.2 band on normal chromosome 10 (28). Using positional cloning, an 80 kb DNA spanning the core centromere protein-binding domain of the neocentromere was isolated from both a normal chromosome 10 (29) and the mardel(10) chromosome (30). Extensive restriction map comparisons between the normal chromosome 10 and the mardel(10) DNA have revealed an identical structural organization (29,30), supporting the hypothesis of neocentromere formation on mardel(10) via an epigenetic mechanism. In this communication, we present the complete nucleotide sequence of the 80 kb neocentromere DNA derived from the mardel(10) chromosome and discuss the results of our extensive computational analyses of this DNA.
RESULTS
Generation of the neocentromere (NC) sequence
In an earlier study, we have designated the neocentromere DNA cloned from the normal 10q25.2 region as the HC DNA (29). In order to distinguish this DNA from the sequence isolated directly from the mardel(10) chromosome (30) and analysed in detail in the present study, the sequence will be referred to as NC (neocentromere) DNA. The completed NC sequence consists of 80 155 bp and can be accessed from GenBank (accession no. AF04284). Nucleotides 1 and 80 155 correspond to the q[prime] (proximal to the normal chromosome 10 centromere) and p[prime] (distal to the normal chromosome 10 centromere) ends, respectively, of the NC DNA on the mardel(10) chromosome (29).
The NC DNA sequence represents the core centromere protein-binding domain of the mardel(10) neocentromere and is expected to contain DNA motifs or structures that enable the nucleation of key centromere proteins to elicit functional activity. A comprehensive multi-organism centromere DNA database was created for this study by compiling all the known centromeric and pericentric DNA sequences from the GenBank database (see Materials and Methods). Homology searches of the NC sequence against this database revealed no striking homologies to any of the sequences in the database. This result indicated that the NC sequence is unique with respect to previously characterized centromere DNA sequences from a range of organisms. Therefore, identification of putative functional elements required a more detailed computational analysis of the NC sequence.
Sequence composition, A + T content and EST matches
The NC sequence is composed of 28.79% A, 20.63% C, 20.87% G and 29.71% T nucleotides. This translates to an A + T nucleotide content of ~58%, compared with 58.7% for 10 Mb of random human genomic sequences (see Materials and Methods), both of which are within the normal range for the human genome (31). A more detailed regional analysis of the NC DNA involving scanning for nucleotide stretches over 30 bp with an A + T content >65% has revealed many small AT-rich islands (Fig.
Figure 1. Distribution of AT-rich islands in the NC sequence. Islands of 30 bp with A + T contents >65% were plotted using the BASEPAIRPLOT program (see Materials and Methods). Window size was set at 30 bp and shift at 25 bp. The arrowhead points to the position of the AT28 sequence which forms an ~600 bp island of >80% A + T. Table 1.
Name (size)
Present
Expected no. (human genome)
Observed no. (NC-DNA)
Homology search (%)
[alpha]-Satellite (171 bp)
no
nd
0
>80
[beta]-Satellite (68 bp)
no
nd
0
>80
[gamma]-Satellite (220 bp)
no
nd
0
>80
Satellite IA (17 bp)
yes
1.05
3.00
>88
Satellite IB (24 bp)
no
nd
0
>80
Satellite III (5 bp)a
yes
264.87
213.70
100
48 bp repeat (48 bp)
no
nd
0
>80
ATRS (483 bp)
no
nd
0
>80
CENPB box motif (15 bp)
no
2.07
1.00
>93
pJ[alpha] motif (9 bp)
yes
1.14
3.00
100
HMGI motif (6 bp)b
yes
1416.20
1339.00
100
TopoII motif (18 bp)
yes
124.30
121.00
>94
The NC DNA spans a normal genomic region (29,30) and, therefore, may contain genes. To determine the presence of expressed sequences and, therefore, potential genes in the NC DNA, we carried out a PowerBLAST search (see Materials and Methods) of the sequence against the GenBank expressed sequence tag (EST) databases. We found three regions in the NC DNA that contained significant matches to human EST sequences. More than 96% identity to the NC DNA was observed with GenBank EST clones AA774571 (NC DNA no. 11 627-13 562), AA584927 (NC DNA no. 41 168-41 536), and AA813722 and AA860318 (NC DNA no. 45 247-45 499). This indicates the presence of expressed sequences in the NC DNA; however, further analysis is required to determine whether or not these are genes.
Pericentric DNA sequences and putative protein-binding motifs
A number of different human pericentric sequences and protein-binding motifs have been described previously. The results of homology searches for these sequences in the NC DNA are summarized in Table 1 and discussed below.
In order to determine whether the abundance of a particular sequence motif within the NC DNA was different from a random stretch of human DNA sequence, it was necessary to derive an expected abundance value for such a motif. This was achieved by creating a mini-database containing sequences derived from >10 Mb of randomly selected human genomic sequences and using this mini-database to calculate the expected abundance for any 80 kb of human genomic sequence (see Materials and Methods).
[alpha]-Satellite, [beta]-satellite, [gamma]-satellite, classical satellites I and III and other pericentric sequences
Fluorescence in situ hybridization (FISH) studies have indicated that functional neocentromeres, including that of the mardel(10) chromosome (28,29), lack demonstrable levels of [alpha]-satellite DNA (5). The NC sequence shows no significant homologies to the consensus 171 bp [alpha]-satellite DNA (32). A direct search for sequences homologous to the pericentric 68 bp [beta]-satellite and 220 bp [gamma]-satellite also proved negative. No recognizable homologies were found with the pericentric 48 bp repeat and ATRS sequences. These results clearly indicate that these well-defined components of the centromere or pericentric regions of normal human chromosomes do not contribute to an essential part of the core protein-binding domain of the mardel(10) neocentromere.
Human satellite I DNA is present in the pericentric regions of chromsomes 3, 4, 13, 14, 15, 21 and 22 (33-35). The satellite I monomer is made up of a 42 bp sequence consisting of two parts: IA, a 17 bp conserved sequence of ACAWAAAATAWSAAAGT; and IB, a more variable 25 bp sequence of ACMYMARVYATRDATTHTATWCTGT (36). Within the NC DNA, satellite IB sequences were not detected, and the presence of three copies (compared with the expected 1.05 copies; Table 1) of the satellite IA monomer is also unlikely to be functionally significant.
Satellite III DNA has an underlying 5 bp repeated motif, TGGAA, that appears to be evolutionarily conserved and found at the pericentric regions of most, if not all, human chromosomes (37). As with the [alpha]-satellite DNA, low stringency FISH analysis failed to detect this DNA at the mardel(10) neocentromere (28). A search of the NC sequence using all five combinations of the satellite III monomer, TGGAA, GGAAT, GAATG, AATGG and ATGGA [since the phasing of the sequence of this repeat monomer is not known (36,38)], showed a prevalence similar to that expected (Table 1). Furthermore, unlike the clustered and tandemly organized structure of normal pericentric satellite III arrays, the observed matches are randomly distributed and non-contiguous. Based on such a dissimilar organization, it becomes difficult to infer any functional significance for the observed satellite III matches, especially given the limited understanding of the functional role of even the authentic pericentric satellite III arrays.
CENPB, pJ[alpha], HMGI and topoisomerase II (topoII) protein-binding motifs
A homology search of the NC DNA for sequences related to the 15 bp degenerate CENPB box motif, TTCGNNNNANNCGGG (39), revealed just one match at the 14/15 nucleotide level, due presumably to a chance occurrence (Table 1). This low level of CENPB box motif is consistent with the absence of detectable CENPB protein binding on the mardel(10) chromosome (28) and shows that the CENPB protein is not necessary for neocentromere formation. Analysis of the NC sequence identified three perfect matches with the human pJ[alpha] motif at positions 10 719, 17 813 and 79 535. However, in view of the high prevalence of the pJ[alpha] motif within the [alpha]-satellite DNA of normal human centromeres (8,9), it is doubtful that these three sporadic copies of pJ[alpha] can exert a major impact on neocentromere formation or activity.
Figure 2. Distribution of putative HMGI and topoII-binding motifs in the NC DNA. (A) The number of sequences with a 100% match to the HMGI motif, WWWWWWS, was plotted using EWINDOWS/ESTATPLOT (see Materials and Methods) with window size set at 100 bp and shift at 95 bp. (B) The number of sequences with >94% match to the vertebrate topoII motif, RNYNNCNNGYNGKTNYNY, was determined using FINDPATTERNS (see Materials and Methods) allowing for a mismatch of one nucleotide. Matches to both forward (+) and reverse complement (-) strands are shown. Each bar represents one match to a motif in the NC DNA. We also searched the NC sequence against the known DNA-binding motifs of two proteins that have been shown to interact with the centromere: the high-mobility-group protein I (HMGI) and DNA protein topoII. HMGI is an abundant protein that recognizes stretches of six or more A or T nucleotides (40). Thus, when searching for this motif, the array WWWWWWS was used to omit overlapping matches where there were more than six A or T nucleotides. When the NC sequence was searched for these HMGI-binding motifs, the observed prevalence was similar to that expected (Table 1). As shown in Figure Searches for topoII-binding sites were based on >94% identity to the published 18 bp vertebrate topoII motif, RNYNNCNNGYNGKTNYNY (41). The abundance of this motif was found to be similar to the expected value (Table 1). Matches to the topoII motifs are distributed throughout the NC DNA but appear to be more concentrated at the p[prime] end segment of ~25 kb (Fig.
Tandem repeats
An initial examination of the NC sequence indicated a lack of any large stretches (>1 kb) of tandemly repeated DNA, suggesting that only relatively small repeat arrays, if any, were present. Detailed computational analysis of the NC sequence has identified a number of such small arrays distributed throughout the sequence (Table 2 and Fig.
The AT28 region was previously shown to be a variable number tandem repeat (VNTR) that varied between the NC DNA on the mardel(10) chromosome and the HC DNA on normal chromosome 10 (30). The region is ~600 bp in size and is the largest member of the tandem arrays within the NC DNA, residing between nucleotides 15 164 and 15 753 (Fig.
Table 2.
| Name | Position (length in bp) | Sequence (size in bp) | No. of copies | Homology (%) | Composition (%) |
| T1 | 2558-2581 (24) | CAGGCACAGTGG (12) | 2 | 100 | 43.4 A + T |
| T2 | 3021-3050 (30) | TAACAAAGTG (10) | 3 | 83.3 | 63.3 A + T |
| T3 | 4173-4211 (39) | AAAAAAATAATTT (13) | 3 | 82.1 | 92.3 A + T |
| AC24 | 6598-6645 (48) | AC (2) | 24 | 100 | 100 A + C |
| GA1 | 10 097-10 136 (40) | GAAAGAAAGG (10) | 4 | 82.5 | 97.5 G + A |
| AT4 | 11 996-12 028 (32) | ATTT (4) | 8 | 100 | 100 A + T |
| AT28 | 15 164-15 753 (589) | ATGTATATATGTGTATATAGACATAAAT (28) | 21 | 82.1 | 81.8 A + T |
| GA2 | 17 406-17 564 (159) | AAGAAGGAAGGAAGAGAAGAAAGAAAAGAAAGA AAAAAAAGGAAAGAAAATA (53) |
3 | 81.8 | 98.7 G + A |
| CT1 | 21 375-21 639 (265) | TTCCCTCCCCCCCCCCTTCCCTCCCTCCTCCCTT CCTTCCTCCCTTCCTTCCT (53) |
5 | 74 | 99.2 C + T |
| T4 | 22 560-22 604 (45) | AATATTACAATAATT (15) | 3 | 80 | 66.0 A + T |
| T5 | 23 098-23 130 (33) | TTTTAAAAATA (11) | 3 | 81.8 | 87.8 A + T |
| T6 | 26 597-26 626 (30) | ATCAATTATT (10) | 3 | 83.3 | 73.3 A + T |
| GA4 | 28 591-28 626 (36) | AAGAAAGGAGGG (12) | 3 | 88.9 | 100 G + A |
| T7 | 28 955-28 984 (30) | TAAAAAAATT (10) | 3 | 83.3 | 93.3 A + T |
| T8 | 31 566-31 598 (33) | TATATTGTAAT (11) | 3 | 81.8 | 90.9 A + T |
| T9 | 35 484-35 525 (42) | ACTCAGCATAGTGG (14) | 3 | 81 | 38.1 A + T |
| T10 | 38 855-38 887 (33) | AAGGTGGAATA (11) | 3 | 81.8 | 54.6 A + T |
| T11 | 40 124-40 153 (30) | TTTATAAATT (10) | 3 | 83.3 | 90.0 A + T |
| T12 | 44 786-44 821 (36) | CTGTGGTTGTTG (12) | 3 | 88.9 | 61.2 A + T |
| T13 | 46 193-46 208 (12) | CATT (4) | 3 | 88.9 | 68.7 A + T |
| AT5 | 46 208-46 230 (20) | TATT (4) | 5 | 100 | 100 A + T |
| AC32 | 51 968-52 045 (78) | AC (2) | 39 | 100 | 100 A + C |
| T14 | 53 919-53 990 (72) | ACCAATCAGCACTCTGTAAAATGG (24) | 3 | 88.9 | 59.7 A + T |
| T15 | 54 208-54 473 (266) | AGGAAGAAACTCCAGACACACCATCTTTAAGAG CTGTAACACTCACTGCAAGGGTCTGCGGCTTCA TTCTTGAAGTCAGCAAGACCAAGAACCCACTGG AAGGAAACAATTCCGGACACATTTTGGTGACCCA (133) |
2 | 88.3 | 52.6 A + T |
| T16 | 55 342-55 511 (170) | GTAAGGGTGCAGGTTTTCAAAAATGTGTTGGTA AGGGCCACTAAATCTGACATTCCTTGGTCCTCC TTGTGGTCTAGGAGGAAAA (85) |
2 | 89.4 | 52.3 A + T |
| T17 | 55 515-55 674 (160) | GTGTTTCTGCTGCTGCATTGGTGGGCTCAACTA TTCCAATCAGCAGGGTCCAGTGACCTTTGCGGG TTCTTGGGTCGGGG (80) |
2 | 92.5 | 45.8 A + T |
| GA5 | 57 642-57 692 (51) | GGAAAGAGAGAGAGAAA (17) | 3 | 82.4 | 92.0 G + A |
| GA3 | 57 742-57 851 (110) | GAGAGAGAGAGAGGGAAAGACA (22) | 5 | 80.9 | 92.7 G + A |
| T18 | 58 130-59 211 (82) | TGTGTCTAGCTAAAGGATTGTAAATGCACCAAT CAGCACTC (41) |
2 | 100 | 58.5 A + T |
| T19 | 59 465-59 730 (266) | AACAAAVTCCAGACACACCATCTTTCAGAGCTGT AACACTCACCGCAAGGGTCTGTGGCTTCATTCTT GAAGTCAGCAAGACCAAGAACCCACCGGAAGGA ACAAATTCCAGACACAGTAGGAAATCTGTATT (133) |
2 | 86.5 | 51.2 A + T |
| T20 | 64 525-64 554 (30) | ATAAAATAAG (10) | 3 | 86.7 | 93.3 A + T |
| T21 | 68 662-68 703 (42) | ATAAAAAAATTAAA (14) | 3 | 85.7 | 95.2 A + T |
| T22 | 70 318-70 362 (45) | ATATATATCTGTGTG (15) | 3 | 82.2 | 82.2 A + T |
| T23 | 75 789-75 827 (39) | TAAAAAAGAATAA (13) | 3 | 87.2 | 94.8 A + T |
In order to discount the possibility that the previously observed difference between the NC and HC DNA in the AT28 region may be directly responsible for the neocentromeric activation of the NC DNA, sequences from a number of normal chromosomes 10 were analysed. The results indicate that the AT28 sequence varies in copy number between 11 and 19 monomeric units within the conserved core region (Fig.
Figure 3. Distribution of low copy number tandem repeats in the NC DNA. Refer to Table 2 for details. Figure 4. Structure and sequence of AT28. (A) Arrangement of 28 bp tandemly repeating monomers (open arrows). PCR primers N17 and N18 (30) amplify an ~1.2 kb fragment containing the AT28 region. Two RsaI (R) sites are present immediately outside the highly conserved core of the AT28 region. (B) Nucleotide sequence of AT28 showing alignment of each 28 bp monomer and derivation of a consensus sequence. Dashed lines represent gaps introduced to optimize alignment. Asterisks denote perfectly conserved monomers. The RsaIsites are underlined. The nucleotide positions of the AT28 region in the NC DNA are shown. (C) Structural features of AT28. Closed arrows show a perfect palindrome around a central TGTG sequence (underlined). Hashed arrows represent a larger and slightly imperfect (at positions 5 and 20) palindrome around a central GTGT sequence (hashed underline). Open arrows indicate a mirror image around a central A nucleotide (double underlined). Table 3.
Template
Copy number
C10-1.5RV
19
BE2C1-18-1f
15
BE2C1-18-5f
18
AE (mother)
15
CE (father)
18
GM10926D
11
MK (female)
17
C198b (male)
17
WM (female)
16
Human transposable elements
Human transposable elements include all interspersed repetitive sequences and comprise 36% of the human genome. The most common of these are Alu elements which account for ~10% of human DNA (44). Using the web-based search tool Repeat Masker (45), all human transposable elements in the NC sequence were identified and classified (Table 4). The observed proportion of NC DNA that is composed of transposable elements was compared with that observed when >7 Mb of random human genomic sequences was screened (44). Overall, with ~40% of the NC sequence being composed of these elements, the representation of human transposable elements in this sequence is not greatly different from that found in the normal human genome. When the individual elements are considered, a slight increase is seen in Alu and MIR, with a much more significant increase observed in the non-mariner DNA transposons and the HERV component of the LTR elements. These increases are counterbalanced by a significant reduction in the level of the LINE1 sequences, although such an alternate distribution of SINE and LINE elements is commonly seen within the human genome (46,47). As shown in Figure
Figure 5. Distribution of human transposable elements in the NC DNA. Positions of these elements are shown by the vertical bars. Each bar represents the approximate proportion (not to scale) of regions composed of transposable elements. In this study, we have elucidated the full nucleotide sequence of the 80 kb NC DNA derived from the analphoid neocentromere at 10q25.2 (28-30). The study represents the first detailed structural characterization of the core centromere antigen-binding domain of any neocentromere DNA. The analysis has established the absence of [alpha]-satellite, [beta]-satellite, [gamma]-satellite, 48 bp repeats and ATRS in the NC sequence. Furthermore, although sequence motifs for the putative [alpha]-satellite DNA-binding proteins, pJ[alpha] (8,10) and CENPB, and the classical satellites IA- and III-related pericentric sequences (36,38) were detected, their relatively low abundance and non-tandem nature within the NC DNA indicate that they are unlikely to mimic any possible role these sequences may have in the normal centromere (11,37). In addition to CENPB and pJ[alpha], at least three other known DNA-binding proteins have been proposed to be associated with the eukaryotic centromere. These are centromere proteins A (CENPA), HMGI and topoII. CENPA is a member of a growing class of proteins referred to as histone H3-like proteins (48-50). The protein is found in association with histone H4 and the other core histones (51,52), and is postulated to act as a histone H3 homologue by replacing one or both copies of histone H3 in centromeric nucleosomes (48). The DNA recognition sequence(s) of CENPA have not been defined and could not be included in the present analysis. HMG proteins are abundant, heterogeneous and non-histone components of chromatin (53). These proteins have been shown to interact with the minor groove of the DNA helix, bind to irregular DNA structures and, via their capacity to bend DNA, are thought to facilitate the formation of higher-order nucleoprotein complexes (54). HMGI binds to sequences where there are stretches of six or more A/T nucleotides (40,55). This protein co-localizes with the G-band as well as the centromeric and telomeric regions of mouse and human metaphase chromosomes (56). More specifically, HMGI has been shown to bind to the 172 bp [alpha]-satellite DNA repeat of the African green monkey in vitro (57). TopoII, on the other hand, works by cleaving and opening one DNA helix transiently, passing a second intact DNA helix through the opening, and then resealing the break (58-62). Through this ability to alter DNA topology, topoII is thought to have a role in a host of cellular processes requiring the modulation of double-stranded DNA, including chromosome condensation and segregation during mitosis. The temporal and spatial distribution of topoII on the chromosome scaffold, including that of the centromere, suggest a role for this protein in the integrity of the centromere structure and/or its function (63,64). Scaffold attachment regions (SARs), which contain topoII motifs, have also been found within the [alpha]-satellite sequences of chromosome 1 (65). This, together with our own observation that the vertebrate topoII motif is present at >85% homology match in the [alpha]-satellite consensus sequence, and at 100% match to a number of other non-chromosome 1-specific [alpha]-satellite sequences, suggests that DNA binding of this protein to normal centromeres may be mediated by a typical vertebrate topoII consensus sequence in most [alpha]-satellite monomers. Our analysis has indicated that the NC sequence contains the expected abundance of putative binding motifs for both HMGI and topoII. Overall, these two motifs are distributed throughout the NC DNA relatively randomly and, therefore, are unlikely to exert any major influence on neocentro-mere formation. However, it is possible that some localized concentrations of these motifs (at position ~15 kb for HMGI and at the p[prime] terminal segment of the NC sequence for topoII) may have an architectural or regulatory role on the modelling of the NC region into a functional higher-order neocentromere structure. Despite the observed lack of evolutionary conservation of the centromere at the nucleotide sequence level [the CEN-DNA paradox (5)], a repetitive and high A + T content appears to be a recurring theme in many centromeric sequences from wide-ranging organisms (for examples, see refs 4,66-72), suggesting that AT-richness could be important for centromere function. Although the NC sequence as a whole does not show a higher than expected A + T content, the q[prime] end of the sequence demonstrates a significantly higher level of AT-rich islands than the remaining regions. Of particular interest is the AT28 region which constitutes an ~600 bp stretch of a tandemly repeated sequence that is comprised of >80% of A + T nucleotides. This region contains a basic 28 bp repeating unit whose palindromic and mirror-image motifs may form interesting secondary structures. The functional significance of this structure and those of the other AT-rich islands is unclear, although it is possible that through some as yet unknown pattern of distribution of these AT-rich islands, a critical A + T constellation or threshold may exist that facilitates the neocentromeric activity of the NC DNA. Our analysis has defined the positions and nucleotide sequences of various low copy number tandem repeats within the NC DNA. A number of these (e.g. T3, AT4, AT28, T5, T6, T7, T8, T11, AT5 and T20-T23; Table 2) are very high in A + T content and will clearly contribute to the AT-rich islands discussed earlier. On the other hand, several of the repeat sequences (e.g. GA1, GA2, CT1 reverse complement, GA4, GA5 and GA3; Table 2) are extremely rich in purine residues. It is interesting that a purine-based AAGAG satellite sequence has been identified within the defined functional centromeric region of the Drosophila Dp1187 minichromosome (72). Whether the various low copy number purine-rich tandem repeats seen in the NC sequence have any functional relevance akin to that of the Drosophila minichromosome is unclear. Furthermore, although these tandem repeats are much smaller than those seen in normal centromeres, they may nonetheless be important for the formation of essential secondary structures necessary for neocentromere function. Table 4. The Dp1187 minichromosome has also been shown to contain a substantial level of transposable elements (72). A detailed search for corresponding human elements has revealed their presence in ~40% of the NC DNA sequence. Whilst this level of transposable elements is not outstandingly different from that found in the human genome, some regional clustering is observed in the NC DNA. This itself is also not uncommon in the human genome and can result in a predisposition to instability of certain chromosomal segments (46). This instability may have implications for neocentromere activation. Furthermore, although a number of previous studies have described the presence of transposable elements in the normal human centromere regions (12,3,74), centromeric heterochromatin is thought to be poor in these elements (75). Whether the transposable elements play any role in neocentromere function is not known, but such elements clearly are not detrimental to this function. In this study, we have characterized the core centromere protein-binding domain of our neocentromere. We were unable to detect any large blocks of tandemly repeated DNA, the single most prevalent feature of all eukaryotic centromeres (except that of Saccharomyces cerevisiae) studied to date. Whilst the sequence (or subregions of it) loosely demonstrate some centromeric features, such as enrichment for AT-rich islands, and the presence of transposable elements as well as interspersed stretches of highly AT- and purine-rich tandem repeats, no single commanding feature comparable with those of any known centromeric DNA is apparent. Importantly, this study indicates that a large tandem array of a particular sequence motif is not essential for centromere activity. This outcome adds to the mounting evidence and belief that there is no ubiquitous magic centromere sequence for the eukaryotic centromeres. Indeed, based on existing information, two speculations can be put forward to account for the sequence nature of eukaryotic centromeres. In the first, rather than having a specific primary sequence requirement, it is the overall nucleotide composition or a particular combination and spatial distribution of different ordinary DNAs that influence centromere conformation and function (5,24,27,72). It may be the presence of lots of different motifs or repeats of broadly varied sequences within a specific distance that provides the signal. In the second perhaps more extreme speculation, there may not even be any compositional, combinational or spatial requirement for an ordinary DNA sequence to be transformed into a centromere or neocentromere. In recent years, increasing attention has been turned to the possible role of epigenetic factors in influencing centromere and neocentromere activities (24-27,29), although what these factors are remains a subject for further work. The detailed study of centromeric structures and sequences is clearly important for understanding the molecular basis of centromere function. The results presented here provide one such study for a neocentromere with which others can be compared. Overlapping YAC and BAC clones spanning the neocentromere (30) and total genomic DNA isolated from a somatic hybrid cell line (BE2C1-18-5f) containing the mardel(10) chromosome in a CHOK1 rodent background (29) were used as templates for subcloning the NC DNA. Long (2-4 kb) PCR products were generated using the Long Template PCR kit (Boehringer Mannheim). These products were then purified using the High-Pure kit (Boehringer Mannheim) and either subcloned into pGEM-T (Promega) and end-sequenced with vector-specific primers or sequenced directly with internal primers designed from the generated sequences. Large cloned products were subcloned further into smaller fragments utilizing restriction endonuclease sites or, for larger clones, nested deletions were performed using the Erase-a-base nested deletion kit (Promega). Automated sequencing was carried out using ABI Prism cycle sequencing and electrophoresed on an ABI377 system according to the manufacturers instructions. Sequences were edited and contigs assembled using Sequencher software (Gene Codes). To determine the presence of EST matches to the NC DNA, a PowerBLAST search (76) of the human EST databases was conducted on a UNIX interface at the Australian National Genome Information Service, ANGIS (77). A total of 10 098 857 bp (>10 Mb) of human genomic sequences randomly derived from 65 distinct loci were selected from GenBank. The accession numbers for these sequences are: AC000377, AC002127, AC002308, AC002366, AC002380, AC002996, AC003016, AC002427, AC004770, AC004501, AC000026, AC000100, AC002368, AC002381, AC002382, AC002383, AC002385, AC002462, AC002463, AC002487, AC002524, AC003015, AC003037, AC003099, AC003684, AC003986, AC004003, AC004083, AC004226, AC004552, AC004615, AC004673, AC004782, AC005220, AC005368, AC005392, AF017257, AF064858, HS164C20, HS175E3, HS390C10, HS445C9, HS57G9, HS941F9, HSAC002064, HSAC002087, HSAF001550, HSU91323, HSU95738, HUAC002041, AC003100, AC003666, AC003685, AC003973, AC003990, AC004038, AC004103, AC004254, AC004554, HUAC004158, HUAC004382, HUAC004531, HUAC004605, HUU91321 and HUU91324. The A + T content was determined for the listed sequences and used to calculate an average percentage of A + T nucleotides over the >10 Mb of random human genomic sequences. The abundance of CENPB, pJ[alpha], satellite IA, satellite III, topoII and HMGI-motifs was also determined for each of the listed sequences. From this, the average number of motifs per kb was determined and thus the expected number for 80 155 - (motif size - 1) bp, corresponding to the size of the NC DNA, was established. The formula for size = 80 155 - (motif size - 1) bp was used because 80 155 is the size of the NC DNA, and, for example, there would be 80 151 copies of a 5mer sequence in 80 155 bp. The expected abundances of the other sequences listed in Table 1 were not determined as they were too large for this analysis. The composition of the NC DNA was determined using COMPOSITION analysis (78). Searches for matches to small (<30 bp) sequence motifs were carried out using FIND-PATTERNS and EWINDOWS/ESTATPLOT of the WGCG Software Package (78) on a web-based interface at ANGIS (77). For the determination of homology to sequences >30 bp in size, we created our own comprehensive database for all known centromere-derived DNA sequences on the GenBank database using CreateDB (78). This was done by searching GenBank for all sequence entries with the word centromere and selecting only those that were derived from the centromere DNA of an organism presented. BLAST and FASTA searches of the NC DNA sequence were then performed on this centromere database, using default parameters. AT-rich islands were determined with BASEPAIRPLOT (78) and tandem repeats with TANDEM (78). Palindromes were identified with STEMLOOP (78). Mirror repeats were identified by eye. The web-based search tool Repeat Masker (45) was used in identification of human transposable elements and some simple repetitive regions that were not detected by TANDEM. We thank Vivien Bonazzi, Bruno Gaeta and Kirsten Balding for advice on computer analyses, and NH&MRC and AMRAD Corp. Ltd for funding support. K.H.A.C. is a Principal Research Fellow of NH&MRC and a Senior Associate of the University of Melbourne.
DISCUSSION
SINEs
LINEs
DNA transposons
Elements with LTRs
Unclassified elements
Total transposable elements
Alu
MIR
LINE1
LINE2
Mariner
Others
HERVs
MaLRs
Others
Human genome
10
1.7
14.6
2.1
0.1
1.5
1.3
2.6
0.7
0.8
35.5
NC DNA
(80.15 kb)12.26
3.48
4.85
2.87
0.09
5.73
7.66
2.68
0.83
0
40.48
0-10
13.11
3.21
16
1.04
0
25.6
3.93
0
0
0
62.9
9-19
14.63
3.9
0
6.27
0
0
0
0
0
0
24.8
18-28
14.88
2.37
5.54
7.21
0
2.92
0
3.56
0
0
36.48
27-37
17.75
4.43
0
4.52
0
5.35
0
17.3
0
0
49.3
36-46
8.88
6.76
0
2.6
0
2.84
0
0
0
0
21.08
45-55
15.78
3.11
19.58
2.13
0
1.99
11.1
0
0
0
53.7
54-64
5.97
2.87
0
0.53
0
3.11
53.9
0
1.3
0
67.66
63-73
4.87
3.01
0
0
0.79
9.78
0
3.32
6.18
0
28.05
72-80.15
14.53
1.68
2.59
1.57
0
0
0
0
0
0
20.37
MATERIALS AND METHODS
Generation of the NC DNA sequence
Calculation of normal A + T content and expected abundance of motifs
Computational analysis
ACKNOWLEDGEMENTS
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 4 Feb 1999
Copyright©Oxford University Press, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
A. E. Hall, G. C. Kettler, and D. Preuss
Dynamic evolution at pericentromeres
Genome Res.,
March 1, 2006;
16(3):
355 - 364.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. E. Hall, S. Luo, A. E. Hall, and D. Preuss
Differential Rates of Local and Global Homogenization in Centromere Satellites From Arabidopsis Relatives
Genetics,
August 1, 2005;
170(4):
1913 - 1927.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
G. C. Ferreri, D. M. Liscinsky, J. A. Mack, M. D. B. Eldridge, and R. J. O'Neill
Retention of Latent Centromeres in the Mammalian Genome
J. Hered.,
May 1, 2005;
96(3):
217 - 224.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Sumer, R. Saffery, N. Wong, J. M. Craig, and K. H. A. Choo
Effects of Scaffold/Matrix Alteration on Centromeric Function and Gene Expression
J. Biol. Chem.,
September 3, 2004;
279(36):
37631 - 37639.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. J. O'Neill, M. D. B. Eldridge, and C. J. Metcalfe
Centromere Dynamics and Chromosome Evolution in Marsupials
J. Hered.,
September 1, 2004;
95(5):
375 - 381.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. Obuse, H. Yang, N. Nozaki, S. Goto, T. Okazaki, and K. Yoda
Proteomics analysis of the centromere complex from HeLa interphase cells: UV-damaged DNA binding protein 1 (DDB-1) is a component of the CEN-complex, while BMI-1 is transiently co-localized with the centromeric region in interphase
Genes Cells,
February 1, 2004;
9(2):
105 - 120.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. Schindelhauer and T. Schwarz
Evidence for a Fast, Intrachromosomal Conversion Mechanism From Mapping of Nucleotide Variants Within a Homogeneous alpha -Satellite DNA Array
Genome Res.,
December 1, 2002;
12(12):
1815 - 1826.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. Tsuda, T. Takarabe, Y. Kanai, T. Fukutomi, and S. Hirohashi
Correlation of DNA Hypomethylation at Pericentromeric Heterochromatin Regions of Chromosomes 16 and 1 with Histological Features and Chromosomal Abnormalities of Human Breast Carcinomas
Am. J. Pathol.,
September 1, 2002;
161(3):
859 - 866.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Y. Saito, Y. Kanai, M. Sakamoto, H. Saito, H. Ishii, and S. Hirohashi
Overexpression of a splice variant of DNA methyltransferase 3b, DNMT3b4, associated with DNA hypomethylation on pericentromeric satellite regions during human hepatocarcinogenesis
PNAS,
July 23, 2002;
99(15):
10060 - 10065.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Ando, H. Yang, N. Nozaki, T. Okazaki, and K. Yoda
CENP-A, -B, and -C Chromatin Complex That Contains the I-Type {alpha}-Satellite Array Constitutes the Prekinetochore in HeLa Cells
Mol. Cell. Biol.,
April 1, 2002;
22(7):
2229 - 2241.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. S. Kondrashov and S. A. Shabalina
Classification of common conserved sequences in mammalian intergenic regions
Hum. Mol. Genet.,
March 1, 2002;
11(6):
669 - 674.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. A. Maggert and G. H. Karpen
The Activation of a Neocentromere in Drosophila Requires Proximity to an Endogenous Centromere
Genetics,
August 1, 2001;
158(4):
1615 - 1628.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
H. F. Willard
Neocentromeres and human artificial chromosomes: An unnatural act
PNAS,
May 8, 2001;
98(10):
5374 - 5376.
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. D. Shelby, K. Monier, and K. F. Sullivan
Chromatin Assembly at Kinetochores Is Uncoupled from DNA Replication
J. Cell Biol.,
November 27, 2000;
151(5):
1113 - 1118.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. E COCKWELL, B. GIBBONS, I. E MOORE, and J. A CROLLA
An analphoid supernumerary marker chromosome derived from chromosome 3 ascertained in a fetus with multiple malformations
J. Med. Genet.,
October 1, 2000;
37(10):
807 - 810.
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
T. A. Ebersole, A. Ross, E. Clark, N. McGill, D. Schindelhauer, H. Cooke, and B. Grimes
Mammalian artificial chromosome formation from circular alphoid input DNA does not require telomere repeats
Hum. Mol. Genet.,
July 1, 2000;
9(11):
1623 - 1631.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. A. Maggert and G. H. Karpen
Acquisition and Metastability of Centromere Identity and Function: Sequence Analysis of a Human Neocentromere
Genome Res.,
June 1, 2000;
10(6):
725 - 728.
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
A. E. Barry, M. Bateman, E. V. Howman, M. R. Cancilla, K. M. Tainton, D. V. Irvine, R. Saffery, and K.H. A. Choo
The 10q25 Neocentromere and its Inactive Progenitor Have Identical Primary Nucleotide Sequence: Further Evidence for Epigenetic Modification
Genome Res.,
June 1, 2000;
10(6):
832 - 838.
[Abstract]
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
E. V. Howman, K. J. Fowler, A. J. Newson, S. Redward, A. C. MacDonald, P. Kalitsis, and K. H. A. Choo
Early disruption of centromeric chromatin organization in centromere protein A (Cenpa) null mice
PNAS,
February 1, 2000;
97(3):
1148 - 1153.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Koch
Neocentromeres and alpha satellite: a proposed structural code for functional human centromere DNA
Hum. Mol. Genet.,
January 22, 2000;
9(2):
149 - 154.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. Saffery, D. V. Irvine, B. Griffiths, P. Kalitsis, L. Wordeman, and K.H. A. Choo
Human centromeres and neocentromeres show identical distribution patterns of >20 functionally important kinetochore-associated proteins.
Hum. Mol. Genet.,
January 22, 2000;
9(2):
175 - 185.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Earle, A. Saxena, A. MacDonald, D. F. Hudson, L. G. Shaffer, R. Saffery, M. R. Cancilla, S. M. Cutts, E. Howman, and K. H. A. Choo
Poly(ADP-ribose) polymerase at active centromeres and neocentromeres at metaphase
Hum. Mol. Genet.,
January 22, 2000;
9(2):
187 - 194.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Henikoff, K. Ahmad, J. S. Platero, and B. van Steensel
From the Cover: Heterochromatic deposition of centromeric histone H3-like proteins
PNAS,
January 18, 2000;
97(2):
716 - 721.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E Csonka, I Cserpan, K Fodor, G Hollo, R Katona, J Kereso, T Praznovszky, B Szakal, A Telenius, G deJong, et al.
Novel generation of human satellite DNA-based artificial chromosomes in mammalian cells
J. Cell Sci.,
January 9, 2000;
113(18):
3207 - 3216.
[Abstract]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
G. Montefalcone, S. Tempesta, M. Rocchi, and N. Archidiacono
Centromere Repositioning
Genome Res.,
December 1, 1999;
9(12):
1184 - 1188.
[Abstract]
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
A. W.I. Lo, G. C.-C. Liao, M. Rocchi, and K.H. A. Choo
Extreme Reduction of Chromosome-Specific alpha -Satellite Array Is Unusually Common in Human Chromosome 21
Genome Res.,
October 1, 1999;
9(10):
895 - 908.
[Abstract]
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
A. W.I. Lo, D. J. Magliano, M. C. Sibson, P. Kalitsis, J. M. Craig, and K.H. A. Choo
A Novel Chromatin Immunoprecipitation and Array (CIA) Analysis Identifies a 460-kb CENP-A-Binding Neocentromere DNA
Genome Res.,
March 1, 2001;
11(3):
448 - 457.
[Abstract]
[Full Text]
![]()
This Article ![]()
![]()
Abstract
![]()
FREE Full Text (PDF)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (68)
![]()
Request Permissions ![]()
Google Scholar ![]()
![]()
Articles by Barry, A. E.
![]()
Articles by Choo, K. H.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Barry, A. E.
![]()
Articles by Choo, K. H.
![]()
Social Bookmarking ![]()
![]()
What's this?