Survey of CAG/CTG repeats in human cDNAs representing new genes: candidates for inherited neurological disorders
Survey of CAG/CTG repeats in human cDNAs representing new genes: candidates for inherited neurological disorders Christian Néri1,*, Véronique Albanèse1,+, Anne-Sophie Lebre1,+, Sébastien Holbert1, Claudine Saada1, Lydie Bougueleret1, Sebastian Meier-Ewert2, Isabelle Le Gall1, Philippe Millasseau1, Hung Bui1, Catherine Giudicelli1, Catherine Massart1, Sophie Guillou1, Patricia Gervy1, Eric Poullier1, Philippe Rigault1, Jean Weissenbach3, Greg Lennon4, Ilya Chumakov1, Jean Dausset1, Hans Lehrach2, Daniel Cohen1 and Howard M. Cann1
1Fondation Jean Dausset-CEPH, 27 rue Juliette Dodu, 75010 Paris, France, 2Max Planck Institute for Molecular Genetics, Berlin, Germany, 3CNRS URA 1922, Genethon, Evry, France and 4Lawrence Livermore National Laboratories (LLNL), Livermore, CA, USA
Received February 29, 1996;Revised and Accepted April 3, 1996
Expansion of polymorphic CAG and CTG repeats in transcripts is the cause of six inherited neurodegenerative or neuromuscular diseases and may be involved in several other genetic disorders of the central nervous system. To identify new candidate genes, we have undertaken a large-scale screening project for CAG and CTG repeats in human reference cDNAs. We screened 100 128 brain cDNAs by hybridization. We also scanned GenBank expressed sequence tags for the presence of long CAG/CTG repeats in the extremities of cDNAs from several human tissues. Of the selected clones, 286 were found to represent new genes, and 72 have thus far been shown to contain CAG/CTG repeats. Our data indicate that CAG/CTG repeated 10 or more timesare more likely to be polymorphic, and that new 3'-directed cDNAs with such repeats are very rare (1/2862). Nine new cDNAs containing polymorphic (observed heterozygote frequency: 0.05-0.90) CAG/CTG repeats have been currently identified in cDNAs. All of the cDNAs have been assigned to chromosomes, and six of them could be mapped with YACs to 1q32-q41, 3p14, 4q28, 3p21 and 12q13.3, 13q13.1-q13.2, and 19q13.43. Three of these clones are highly polymorphic and represent the most likely candidate genes for inherited neurodegenerative diseases and, perhaps, neuropsychiatric disorders of multifactorial origin.
The expansion of highly polymorphic CTG or CAG triplet repeats is the causative mutation for six human hereditary neurological diseases including myotonic dystrophy (MD), spinobulbar muscular atrophy (SBMA), spinocerebellar ataxia (SCA) 1 and 3, dentatorubro-pallidoluysian atrophy (DRPLA) and Huntington's disease (HD) (for review, see 1 ). CAG/CTG expansion (dynamic mutation) in each of these diseases is associated with the phenomenon of anticipation where the repeat size correlates inversely with the age of onset in succeeding generations of disease families. CTG repeat expansion occurs in the non-coding region of the MD transcript and CAG repeat expansion in coding regions of the androgen receptor (SBMA), SCA 1 and 3, DRPLA and HD transcripts. Some of these transcripts appear to be expressed in tissues other than brain. The CAG repeats, when translated, result in gene products carrying an expanded polyglutamine stretch (2 -4 ). Evidence for anticipation in disease families, CAG/CTG expansion in genomic DNA of patients (5 ) and expanded polyglutamine domains at the protein level (2 ) has suggested that dynamic mutations are implicated in SCA2 (2 ,6 ), SCA5 (7 ), autosomal dominant cerebellar ataxia (ADCA) type II (2 ,8 ), and familial forms of bipolar affective disorder (BPAD) and schizophrenia (9 -11 ). Familial dementia may also be caused by a dynamic mutation (12 ).
To search for CAG/CTG repeat (hereafter referred to as [CAG]n) polymorphism in human cDNAs and characterize new candidate genes for the diseases in which dynamic mutations may be implicated, we have undertaken a large-scale screening project based on (i) oligonucleotide hybridizations of high-density membranes from two reference human brain cDNA libraries for clones that contain [CAG]n, (ii) scanning of human expressed sequence tags (ESTs) in GenBank for the presence of [CAG]n >9 in cDNAs from brain and other human tissues (13 ), (iii) sequencing of hybridization-selected clones, (iv) use of yeast artificial chromosome (YAC) physical maps of the human genome for cDNA localizations (14 ), and (v) examination of repeat sequences for polymorphism. One of the cDNA libraries screened contains fetal brain (FB) 3'-directed cDNA clones (15 ); the other library contains normalized infant brain (NIB) 3'-directed cDNA clones (16 ).
The central part of this project is the comparative sequence analysis of four groups of clones either selected by hybridization of FB or NIB libraries or selected from GenBank ESTs (Fig. 1 ). We have made the following observations. First, FB and NIB libraries represent complementary sources of potential new candidate genes. Second, the almost complete sequencing of FB clones representing new genes suggests that [CAG]n >9 are rare and best selected under high stringency hybridization conditions. Third, primers designed from 100-200 bp [CAG]n-spanning regions allow both polymorphism evaluation and localization in the genome. Fourth, polymorphic [CAG]n from human tissues other than the brain could be identified. We identified a group of polymorphic (CAG)8-28 repeats with observed heterozygote frequencies of 0.05-0.90 and corresponding to previously unknown genes. The data presented here will contribute to the cloning of genes for hereditary neurodegenerative or neuropsychiatric disorders and to the characterization of human gene-based markers.
The hybridization of 60 000 FB cDNA clones (colonies arrayed on membranes), using either low or high stringency conditions, led to the identification of 267 clones upon confirmation of positive clones by hybridization of PCR products. One hundred and forty five clones were selected only under low stringency and 26 at only high stringency (see Materials and Methods), and 96 with both conditions. Twenty eight clones were rejected from the study because they displayed two equally strong bands by PCR, probably due to the presence of two colonies per well in microtiter storage plates. An average size of 1.2 kb was observed for the 239 remaining clones, of which 211 (88.3%) could be sequenced (Table 1 ). Eighty eight unique clones representing new genes were identified, and 95% of the complete sequence was generated for these. Consistent with the observed mean insert size, an average number of 3.4 runs per clone (mean number of readable nucleotides per run, 400) was required to generate the sequences of these cDNAs. [CAG]n >3 were found in 41 clones (46.5%), of which 13 carried [CAG]n >9. Most of [CAG]n >9 (11/13) were carried by the clones selected under both high and low stringency conditions, while most of the clones carrying no [CAG]n (31/47) were selected under low stringency conditions (Table 1 ). There was no significant correlation between hybridization signal intensity and the length of [CAG]n (data not shown). The conditions used for hybridization did not distinguish perfect from complex repeats, the latter defined here as carrying triplet interspersions that show one or two nucleotide differences from CAG/CTG or blocks of adjacent, different, triplet repeats (data not shown).
As high stringency hybridization improved the yield of FB clones carrying [CAG]n >9, the 40 128 NIB arrayed clones were screened subsequently using high stringency conditions only. One hundred and seventy-nine clones were selected (Table 2 ), and ESTs were retrieved from Genbank for 127 of them. One hundred non-redundant clones were identified, of which 74 represent new genes distinct from FB new clones. Two hundred and fifty-eight hybridization-positive NIB clones have been selected at Lawrence Livermore National Laboratories (G. Lennon, list available on the IMAGE World Wide Web server). We extracted ESTs for 223 of them, of which 103 were found to represent unique, new genes and to differ from the FB and NIB groups generated at CEPH (Table 2 ). The small number of clones with [CAG]n identified in the two NIB groups (10 clones) reflects the percentage of complete sequence available. Given the average insert size for these cDNAs (1.5 kb) and the number of available ESTs, we estimate that 25-30% of the sequence was retrieved from GenBank.
Values indicate the number of cDNA clones in each category. Positives shown were selected under high (H) or low (L) stringency conditions, or both (HL) (see Materials and Methods). U: non redundant, UN: non redundant and new (meaning no homology with previously known coding sequences or genes; this includes sequences identical to previously known ESTs).aConfirmed hybridization positives showing a single insert. b5' or 3' ESTs generated at CEPH.cEstimated as the ratio length of sequence performed/length of cDNA (see Results). dMajority of dinucleotide repeats and few tri- (GAG) tetra- (ATTT) or hexanucleotide repeats.
aRetrieved from Genbank and accounting for 25-30% of the full sequence of the clones indicated.bSelected under high stringency conditions.cList of selected clones available on the IMAGE server. dFound in ESTs.UN unique and new; - does not apply.
To select additional [CAG]n >9 which are present in cDNAs from other tissues as well as fetal and infant brain, we scanned >200 000 Genbank human ESTs (the vast majority of them 5'- or 3'-ESTs generated by the Merck-WU EST program) and selected 99 ESTs (Table 3 ). Fifty six represented new genes, and cluster analysis delineated 23 contigs or individual ESTs (see Materials and Methods). Of these, 21 from tissues other than brain were distinct from the sequenced FB and NIB clones. Twelve were rejected from the study since they carried a [CAG]n in regions which do not support the design of primers (interspaced short repeats, high percentage of GC, [CAG]n very close to vector sequences) for polymorphism analysis.
aExtracted from Genbank release #91 using a (CAG)25 BLASTN query and selecting sequences producing a P(N) value <10-5. bSome [CAG]n could not be tested because they were localized in regions that did not support the design of primers (see Materials and Methods).N, corresponding to new genes; NU, corresponding to unique, as shown by clustering, and new genes.
Based on the frequency of redundant cDNAs (38%) among 211 sequenced FB clones (Table 4 ), we estimate that 37 200 unique clones are contained in the FB library. Since we found 13 clones carrying [CAG]n >9 after nearly complete sequencing of the 88 hybridization-positive FB clones representing new genes, the frequency of new [CAG]n>9 in unique 3'-directed FB cDNAs is 1/2862 (13/37 200). The frequency for unique and new clones with [CAG]n >9 from Genbank scanning is based on a small number of clones (23 clones, see Table 3 ). Assuming that 51 600 unique cDNAs are represented by >200 000 human ESTs (M. Boguski, NCBI, pers. comm.), we estimate that the frequency of new [CAG]n >9 in Genbank unique 3'-directed cDNAs is 1/2243. This frequency represents a minimal estimate because it is based on partial cDNA sequences (ESTs).
The first 19 FB [CAG]n (3<n<28) identified were examined for polymorphism in 80 independent chromosomes from the CEPH reference families. Each of the five [CAG]n>9 found were polymorphic, whereas all other [CAG]n (n<10) were not, suggesting that nine or more CAG units are more likely to be polymorphic. This observation led us to test all [CAG]n >9 for polymorphism. Twenty three [CAG]n >9 from the group of new clones described above (Tables 1 , 1 and 1 ) were tested for polymorphism. No PCR product could be observed for four of them, even after using several different primer pairs, and these clones are being sequenced again. Ten [CAG]n >9 were not polymorphic, of which nine were complex repeats. Nine [CAG]n >9 displayed observed heterozygote frequencies (HTZ) of 0.05-0.90 with 2-15 alleles ranging in size from (CAG)7 to (CAG)28 (Tables 5 and 5 ); seven of these were perfect repeats.
. Summary of characterization of FB and NIB libraries
Category
FB Library
NIB Library
CEPH
LLNL
Clones screened
60 000a
40 128b
40 128
Hybridization-positive
239
179
258
Sequenced
211
127
223
Unique
131 (62%)
105 (82.6%)
183 (82.06%)
Unique and new
88 (41.7%)
75 (59%)
122 (54.7%)
Complete sequence
Yes (95%)
No
No
Carrying [CAG]n >9
13
na
na
Unique clones in libraryc
37 200
33 065
33 065
Frequency of new [CAG]n >9
1/2862
na
na
aLow and high stringency.bHigh stringency.cEstimated from the percentage of redundant hybridization positive sequenced clones (for NIB: mean value from CEPH and LLNL data).na: not available.
Table 5 . cDNA clones carrying polymorphic [CAG]n were mapped with a somatic cell hybrid panel and CEPH YACs (ref.18 , Table 5 ). For six of these cDNAs, one to four YACs per clone were detected, each localized by their sequence-tagged sites (STS) content or that of neighbors. Clones 2.81, i.181 and i.182 could not be mapped to YACs. The primers used for YAC mapping of these clones permitted chromosome assignment and analysis for polymorphism, suggesting deletion or lack of coverage in the YAC library of the genomic regions carrying the corresponding genes. Clone 2.46 is highly abundant (found as 20 FB clones and 10 ESTs in Genbank). This clone, previously assigned to chromosomes 3 and 12 (17 ), is mapped here to 3p21 and 12q13.3. Clone 2.46 corresponds to two related genes: one, on chromosome 3, with a highly polymorphic [CAG]n (HTZ = 0.7), and the other, on chromosome 12, with a [CAG]n showing less polymorphism (HTZ = 0.2) (17 ). The other highly polymorphic [CAG]n carried by clone 2.116 and clone i.8 mapped to 3p14 and 13q13.1-q13.2, respectively.
As expansion of [CAG]n in transcribed sequences is thought to be involved in several genetic disorders (6 -11 ), cDNAs that contain these repeat sequences represent a source of potential candidate genes for these disorders. The detection of [CAG]n by hybridization of plated human brain cDNA libraries has been reported previously (17 -19 ). Li et al. (17 ) identified two highly polymorphic [CAG]n, one of which was subsequently found to be expanded in DRPLA (4 ). In the present study, we describe the comprehensive analysis of human 3'-directed reference cDNAs for the identification of clones representing new genes that carry polymorphic [CAG]n. Following hybridization selection of cDNAs carrying [CAG]n, their sequences were generated at CEPH (FB), and those existing as ESTs were extracted from Genbank (NIB). The rationale for scanning database ESTs is 2-fold. In addition to providing ESTs for clones selected by hybridization, human EST collections constitute a valuable source of additional [CAG]n carried by new genes. EST analysis is limited, given that only one or both termini of cDNA inserts are informative, and selection by hybridization as well as by database scanning are complementary approaches that should lead to the progressive identification of most polymorphic [CAG]n in new human genes.
We screened reference brain cDNA libraries because they constitute widely used resources in human genome analysis. At this time, only 3'-directed brain cDNA libraries are arrayed on high-density membranes. The use of such libraries increased the efficiency and accuracy of screening. The preparation of reference 5'-directed cDNA libraries using newly developed techniques should enhance the detection of [CAG]n in human mRNAs, especially those longer than the average size of 3'-directed cDNAs.
Our data indicate that the approach outlined here is useful for selecting [CAG]n in 3'-directed cDNAs. We have indeed identified a number of known genes which carry [CAG]n including, among others, the signal recognition particle SRP14 (GenBank accession no. X73459), the transcription factor E2F-4 (GenBank accession no. U15641) and the chromosome 14 gene SCA3 (Stratagene lung library #9372210; accession no. T61453) that determines autosomal dominant cerebellar ataxia and Machado Joseph disease. cDNAs representing the genes for SBMA, SCA1, HD and DRPLA (1 ) were not expected to be found among the 3'-directed clones analyzed, because the position of the [CAG]n for each is located in or toward the 5' end of the transcript.
The use of high stringency conditions permitted the identification of a subpopulation of FB cDNA clones which contains a higher proportion of longer [CAG]n (n >9 in this study) than with low stringency hybridizations. High stringency screening of the NIB library is expected to yield cDNAs enriched for longer [CAG]n. Our data suggest that [CAG]n >9 are more likely to be polymorphic since all [CAG]n<10 tested showed no or minimal polymorphism. Previous studies on [CAG]n polymorphism have also suggested that the length of [CAG]n is related to the degree of polymorphism (17 -21 ). However, our data indicate that there is no absolute correlation between these two variables.
The polymorphic [CAG]n in normal alleles at the SCA1 locus is interrupted by 1-3 CAT codons interspersed among CAGs, and the sequences for normal alleles at the HD, DRPLA, SCA3 and androgen receptor (SBMA) loci tend to show a small number of CAA repeats at or near one or the other of the [CAG]n (1 ). The significance of the latter variations for polymorphism and expansion is unclear. We tested perfect and complex (simple interruptions, interspersed triplets and other configurations) [CAG]n >9 for polymorphism. Most of the complex repeats were found to be monomorphic, while most of the perfect [CAG]n >9 were polymorphic.
Most of the new clones selected from the NIB library were shown to be distinct from the FB library, an observation consistent with the human brain expressing different genes at various developmental stages (13 ). The NIB library has been screened for [CAG]n both at CEPH and LLNL, both laboratories participating in the IMAGE public consortium. EST analysis showed that most of the new clones selected by the two laboratories are different, probably because of differing screening conditions. Complete sequencing and analysis of cDNAs selected in both laboratories is expected to identify additional [CAG]n >9. As expected, the mean percentage of redundant NIB clones (18%), as determined by EST analysis, is less than that observed for the FB library (38%). In addition, EST analysis shows that the mean percentages of new cDNAs in the NIB and FB libraries are similar, 57 and 42%, respectively.
The frequency of [CAG]n >9 in unique 3'-directed FB cDNAs is estimated to be 1/2862, based on the almost complete (95%) sequencing of 88 clones. The parallel but minimal estimate derived from unique ESTs in Genbank is 1/2243. These two estimates support the concept that clones containing [CAG]n>9 are rare in 3'-directed human cDNAs.
In this study, GenBank scanning permitted the identification of clones with polymorphic [CAG]n >9 in tissues other than brain. As the diseases known to be caused by CAG expansion mainly involve the brain, it is relevant to test clones with highly polymorphic [CAG]n for brain expression in order to assess their candidate status. Although the EST approach allows for a global examination of steady-state mRNA levels based on cDNA sequences (13 ), the failure to observe a transcript by sequencing brain cDNA libraries does not prove the lack of brain expression, as illustrated in this study by an EST corresponding to the SCA3 gene and found only in a lung library.
Except for clone 2.116, which is localized to 3p14, the other two cDNAs with highly polymorphic [CAG]n currently identified do not map to the regions of genes for disorders which may be caused by [CAG]n expansion. Clone 2.116 was considered a candidate for ADCA II, localized to 3p12-p21, but the [CAG]n is not expanded in patients (unpublished data). Loci for SCA2 and 5 are localized to chromosomes 12q24.1 (6 ) and 11q, respectively (7 ). Loci for BPAD on chromosome 18 (22 ), schizophrenia on 6p24-p22 (23 -25 ) and familial dementia on chromosome 3 (12 ) have been reported. Schizophrenia and BPAD are almost certainly genetically heterogeneous for mapped and unmapped loci. It may therefore be useful to test patients with disorders suspected of being caused by CAG/CTG expansion at unmapped loci with the current or forthcoming candidate clones.
An additional criterion for assessment of [CAG]n candidate status is the evidence for open reading frames (ORFs) in which [CAG]n encodes a polyglutamine stretch (2 -4 ). Such tentative ORFs were detected for the highly polymorphic [CAG]n carried by clones 2.46 (17 ), 2.116 and i.8 using extended (700-1300 bp) nucleotide sequences. No homology with known genes was found using these sequences.
This study suggests that the frequency of polymorphic [CAG]n in 3'-directed human cDNAs representing new genes is low. This coincides with previous observations indicating that all classes of trinucleotide repeats are less frequent in the genome than (AC)n and that the [CAG]n class is less informative than other trinucleotide microsatellites (20 ). This also coincides with recent observations indicating that the frequency of polymorphic [CAG]n in genomic clones is low (21 ). The sequencing and analysis of NIB groups of [CAG]n-containing new cDNAs will allow more robust estimations of their frequency.
Our data suggest that the number of hereditary neurological diseases caused by CAG/CTG expansion is likely to be small. We are continuing to sequence hybridization-selected cDNAs which show no EST in Genbank nor carry [CAG]n >9 in ESTs, and we expect that this strategy will permit the identification of other transcripts with highly polymorphic [CAG]n, potentially expanded in human genetic diseases. Finally, in addition to providing a source of candidate genes, the cDNAs with polymorphic [CAG]n found in this study constitute a series of gene-based STSs.
The data generated in this study are available through the World Wide Web CEPH server and GDB (see Materials and Methods).
Two libraries prepared from oligo(dT)-primed cDNAs were screened. The first library (15 ) contains 60 000 non-normalized, human, fetal brain cDNAs (FB clones) directionally subcloned into the pSPORT-1 vector (Life Technologies, Inc., Gaithesburg, MA). High-density membranes prepared from colonies (15 ) and selected FB clones were kindly provided by Dr S. Meier-Ewert and Dr H. Lehrach (Max Plank Institute for Molecular Genetics MPI/MG, Berlin, Germany). The second library contains 40 128 normalized infant brain cDNAs (NIB clones) directionally subcloned into a lafmid vector (16 ), and is part of a resource used by the IMAGE public consortium (Lawrence Livermore Laboratories, Livermore, CA) and the Merck-WU EST program (Washington University, St Louis, MO). High-density membranes prepared from PCR products of NIB clones were provided by Dr C. Auffray (CNRS URA41, Villejuif, France). The gridded NIB library was obtained from Research Genetics (Huntsville, AL). The sequences of ESTs corresponding to hybridization-positive NIB clones or clones found by database scanning to carry [CAG]n in ESTs were retrieved from the GenBank database (NCBI, Bethesda, MA).
The oligonucleotides (CAG)6 and (CAG)12 and the oligonucleotides flanking the cloning sites of the lafmid vector (P2a: 5'-gaattgtgagcggataacaatttcacacag-3', and P8b: 5'-tcccagtcacgacgttgtaaaacgac-3') were labeled using polynucleotide kinase and [[gamma]-32P]dATP. The pSPORT-1 vector was doubly labeled by random priming using [[alpha]-35S]dATP and [[alpha]-35S]dCTP. Prehybridization of membranes was performed in 6* SSC, 7% Sarcosyl at 30oC or 70oC for 1 h. Hybridizations were performed overnight with identical buffer and conditions. Labeled pSPORT-1 vector (hybridization of FB filters) and lafmid vector oligonucleotides (hybridization of NIB filters) were added for background in order to improve the scoring of positive clones. Membranes were washed in 3 M tetramethylammonium chloride, 7% SDS for 10-20 min at 65oC and exposed to Amersham hyperfilm for 4-8 h at -80oC.
Hybridization images were acquired from autoradiograms with a scanning device (Truvel, Hernoon, VA). Positive clones were scored and signal intensities quantified using the XDotsReader[middot] software (COSE, Dugny, France) on a Sun Sparc station. After determination of storage plate well coordinates in the library of origin, selected clones were picked and stored at -80oC for further analysis.
Plasmid minipreps were prepared from hybridization-positive clones with plasmid mini kit Tip 20 (QIAGEN, Chatsworth, CA), as recommended by the manufacturer, and then subjected to PCR amplification. The vector primers used for the amplification of cDNA inserts were: Pa (5'-ccggtccggaattcccgggt-3') and Pb (5'-gcacgcgtacgtaagcttggatcct-3'). Amplification conditions were initial denaturation 96oC, 5 min followed by 94oC, 1 min; 68oC, 1 min; 72oC, 1 min for 25 cycles. PCR products were purified on S-400HR Microspin columns (Pharmacia, Uppsala, Sweden) as recommended by the manufacturer. Purified PCR products were analyzed on 1% agarose gels stained with ethidium bromide, and spotted manually onto Hybond N+ membranes (Amersham) with a 96 pin spotting device. PCR products were spotted in duplicate. These membranes were hybridized under conditions identical to those used for the screening of colony membranes. Initially, positive colonies that were not confirmed by hybridization of PCR products and/or that showed double inserts were not sequenced.
5' and 3' end-sequencing of FB (using Pa and Pb primers) and NIB clones (using P2a and P8b primers) was performed on double-stranded DNA templates. Either purified PCR products (5'- and walk sequencing) or DNA minipreps (3'-sequencing) were used as templates. DNA minipreps were prepared from overnight 2.5 ml bacterial cultures using the QIAGEN plasmid mini kit Tip 20 as recommended by the manufacturer. The PCR amplification and purification conditions of cDNA inserts were identical to those used for the quality control of FB clone hybridization described above. Direct sequencing using the dye-terminator technology was performed on ABI373A or ABI377A automatic sequencers (Applied Biosystems) as recommended by the manufacturer. More recently, we used newly available thermostable sequenases which greatly improved the quality of sequencing data.
Homology searches were performed on FB clone sequences masked for Alu, vector, bacterial and simple repeat sequences. cDNA sequences were ordered into contigs and protein sequences derived with the Staden package (26 ). Homologies between cDNA consensus sequences or individual ESTs and GenBank were scored using BLASTN (27 ) and BLASTX (28 ). Walk sequencing was performed only for those cDNA clones different from previously known full coding sequences and until a [CAG]n >9 was detected in order to maximize capture of long [CAG]n. Oligonucleotides for walk sequencing were designed using the Oligo Selection Program (OSP) (29 ), following standard criteria [GC content (40-60%), melting temperature (55-65oC), oligo length (17-23mer)].
For NIB clones, a similar approach was used except that unique ESTs representing new genes were selected from cDNA sequences available in GenBank. 5' and 3' end sequencing is being performed for those clones with no sequence available in GenBank, and for the clones selected at CEPH or LLNL that did not show [CAG]n >9 in their ESTs.
GenBank releases were scanned for the presence of [CAG]n >9 in ESTs. We found that the use of the BLAST software with a (CAG)25 query sequence and the selection of subject sequences producing a P(N) value smaller or equal to 10-5 permits all [CAG]n >9 to be retrieved. After sorting human sequences, GenBank similarity searches were performed to identify ESTs representing new genes. Resulting ESTs were then assembled into clusters using the Staden package in order to assess the putative redundancy and to select unique ESTs. The most recent analysis of GenBank was performed on release #91.
The four sets of resulting sequences representing new genes (FB consensus or individual ESTs, NIB clusters or individual ESTs from CEPH or LLNL, and additional GenBank ESTs carrying [CAG]n >9) are tested on a regular basis for possible overlaps using pairwise comparisons (BLASTN) and for homology against GenBank (BLASTN and BLASTX) to confirm novelty. Homology analysis was performed with the BLAST series on GenBank release #91. Sequenced repeats (CAG/CTG repeats, other triplet repeats, as well as di, tetra or hexanucleotide repeats) were classified according to the number of copies and to their structure (perfect or complex).
The selection of primers suitable for polymorphism and mapping studies was performed using OSP with the following criteria (PCR product size, 100-300 bp; GC content, 40-60%; melting temperature, 52-65oC; oligo length, 17-25mer; and absence of perfect homology with previously known human sequences). All tests were repeated on sequences showing <1% indetermination. The regions immediately upstream and downstream of [CAG]n in FB clones were sequenced at least twice to ensure accuracy of primer sequences. Some [CAG]n retrieved from databases could not be analyzed readily because the flanking regions were too short in length, highly GC-rich or carried several interspersed short repeats. Each primer pair derived from cDNA sequences was first tested on total human, hamster and mouse genomic DNAs using standard PCR amplification conditions to ensure the absence of intronic sequences in genomic DNA as well as high background.
DNAs of 40 parents from CEPH pedigrees (02, 12, 17, 21, 23, 28, 35, 37, 45, 66, 102, 884, 1331, 1332, 1340, 1347, 1362, 1413, 1416, 1423) were used to test for polymorphism of [CAG]n. PCR amplifications were performed on PTC-100 thermocyclers (MJ Research) in 50 [mu]l reactions containing standard 1* PCR buffer (Perkin Elmer Cetus), 250 [mu]M of each dNTP, 20 pM of each primer, 8% dimethyl sulfoxide (DMSO), 100 ng of genomic DNA, and 2.5 U of Taq DNA polymerase (Perkin Elmer Cetus). Amplification conditions were as follows: initial denaturation 96oC, 5 min followed by 94oC, 1 min; 48-59oC, 1 min; 72oC, 1 min; for 30 cycles. Amplification conditions were optimized for each primer pair. PCR products were denatured for 5 min at 96oC, loaded onto 7 M urea-6% acrylamide gels, and blotted following migration on Hybond N+ membranes (Amersham) by standard methods. Membranes were hybridized and band patterns visualized with the ECLtm nucleotide detection and revelation system (Amersham) and a (CAG)10 oligonucleotide probe as recommended by the manufacturer.
For chromosomal assignment, the Coriell human*rodent somatic cell hybrid panel NGIMS #2 was used (30 ). Fifty ng of DNA were used from each somatic cell hybrid in the panel and from human, mouse and hamster genomic DNA. The primer pairs used were identical to those used for the analysis of polymorphism. PCR amplifications were performed as mentioned above, except that an annealing temperature of either 50 or 55oC was used. PCR products were analyzed on 2% agarose gels stained with ethidium bromide. A concordance over three different experiments was required for a chromosome assignment.
The megabase-insert, CEPH YAC library, DNA superpools and subpools (14 ) were used for YAC mapping by PCR. Amplification conditions for each primer pair were identical to the ones used for somatic cell hybrid mapping. YACs were identified by PCR testing of plate, row and column subpools, and confirmed by testing individual YAC DNAs.
The complete dataset including clone identifiers in library of origin and primer sequences used is available from the World Wide Web CEPH server (URL address: http://www.cephb.fr) with links to the RLDB.2 database (where FB clones can be obtained), and the IMAGE server (where NIB clones can be obtained). Primer sequences and polymorphism data are also available from GDB.
We thank G. Zehetner (MPI, Berlin, Germany), R. Houlgatte and C. Auffray (CNRS UPR41, Villejuif, France), and A. Degavre and D. Caterina (CEPH, Paris) for help in the management of clones and data.
2 Trottier, Y., Lutz, Y., Stevanin, G., Imbert, G., Devys, D., Cancel, G., Saudou, F. et al. (1995) Polygluatmine expansion as a pathological epitope in Huntington's disease protein and four dominant cerebellar ataxias. Nature,378, 403-405.MEDLINE Abstract
3 Servadio, A., Koshy, B., Armstrong, D., Antalffy, B., Orr, H.T. and Zoghbi, H.Y. (1995) Expression analysis of the ataxin-1 protein in tissues from normal and spinocerebellar ataxia type 1 individuals. Nature Genet.,10, 94-98.MEDLINE Abstract
4 Yazawa, I., Nukina, N., Hashida, H., Goto, J., Yamada, M. and Kanazawa, I. (1995) Abnormal gene product identified in hereditary dentatorubral-pallidoluysian atrophy (DRPLA) brain. Nature Genet.,10, 99-103.MEDLINE Abstract
5 Schalling, M., Hudson, T.J., Buetow, K.H. and Housman, D.E. (1993) Direct detection of novel expanded trinucleotide repeats in the human genome. Nature Genet.,4, 135-139.MEDLINE Abstract
6 Gispert, S. Lunkes, A., Santos, N., Orozco, G., Ha-Hao, D., Ratzlaff, T., Aguiar, J. et al. (1995) Localization of the candidate gene D-amino acid oxidase outside the refined 1-cM region of spinocerebellar ataxia 2. Am. J. Hum. Genet., 57, 972-975.
7 Ranum, P.W., Schut, L.J., Lundgren, J.K., Orr, H.T. and Livingston, D.M. (1994) Spinocerebellar ataxia type 5 in a family descended from the grandparents of President Lincoln maps to chromosome 11. Nature Genet., 8, 280-284.
8 Benomar, A., Krols, L., Stevanin, G., Cancel, G., LeGuern, E., David, G., Ouhabi, H., Martin, J.-J. et al. (1995) The gene for autosomal dominant cerebellar with pigmentary macular dystrophy maps to chromosome 3p12-p21.1. Nature Genet.,10, 84-88.MEDLINE Abstract
9 Lindblad, K., Nylander, P.O., De Bruyn, A., Sourey, D., Zander, C., Engstrom, C., Holmgren, G. et al. (1995) Detection of expanded CAG repeats in bipolar affective disorder using the repeat expansion detection (RED) method. Neurobiol. Dis.,2, 55-62.MEDLINE Abstract
10 Morris, A.G., Gaitonde, E., McKenna, P.J., Mollon, J.D. and Hunt, D.M. (1995) CAG repeat expansions and schizophrenia: association with disease in females and with early age-at-onset. Hum. Mol. Genet., 4, 1957-1961.MEDLINE Abstract
11 O'Donovan, M.C., Guy, C., Craddock, N., Murphy, K.C., Cardno, A.G., Jones, L.A., Oven, M.J. and McGuffin, P. (1995) Expanded CAG repeats in schizophrenia and bipolar disorders. Nature Genet.,10, 380-381.MEDLINE Abstract
12 Brown, J., Ashworth, A., Gydesen, S., Sorensen, S., Rossor, M., Hardy, J. and Collinge, J. (1995) Familial non-specific dementia maps to chromosome 3. Hum. Mol. Genet., 4, 1625-1628.MEDLINE Abstract
13 Adams, M.D., Kervalage, A.R., Fleishmann, R.D., Fuldner, R.A., Bult, C.J., Lee, N.H., Kirkness, E.F. et al. (1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature,377, 3-47.MEDLINE Abstract
14 Chumakov, I., Rigault, P., Le Gall, I., Bellanne-Chantelot, C., Billault, A., Guillou, S., Soularue, P. et al. (1995) A YAC contig map of the human genome. Nature,377, 174-182.
15 Meier-Ewert, S., Maier, E., Ahmadi, A., Curtis, J. and Lehrach, H. (1993) An automated approach to generating expressed sequence catalogues. Nature,361, 375.MEDLINE Abstract
16 Soares, M.B., Bonaldo, M.F., Jelene, P., Su, L., Lawton, L. and Efstratiadis, A. (1994) Construction and characterization of a normalized cDNA library. Proc. Natl Acad. Sci. USA,91, 9228-9232.MEDLINE Abstract
17 Li, S.-H., McInnis, M., Margolis, R.L., Antonarakis, S.E. and Ross, C. (1993) Novel triplet repeat containing genes in human brain: cloning, expression, and length polymorphisms. Genomics,16, 572-579.MEDLINE Abstract
18 Riggins, G.J., Lokey, L.K., Chastain, J.L., Leiner, H.A., Sherman, S.L., Wilkinson, K.D. and Warren, S. (1992) Human genes containing polymorphic trinucleotide repeats. Nature Genet.,2, 186-191.MEDLINE Abstract
19 Jiang, J.-X, Deprez, L., Zwathoff, E.C. and Riegman, P.H.J. (1995) Characterization of four novel CAG repeat-containing cDNAs. Genomics,30, 91-93.MEDLINE Abstract
20 Gastier, J.M., Pulido, J.C., Sunden, S., Brody, T., Buetow, K.H., Murray, J., Weber, J.L., Hudson, T.J., Sheffield, V.C. and Duyk, G.M. (1995) Survey of trinucleotide repeats in the human genome: assessment of their utility as genetic markers. Hum. Mol. Genet.,10, 1829-1836.
21 Gastier, J.L., Brody, T., Pulido, J.C., Businga, T., Sunden, S., Hu, X., Maitra, S. et al. (1996). Development of a screening set for new (CAG/CTG)n dynamic mutations. Genomics,32, 75-85.
22 Berretini, W.H., Ferraro, T.N., Goldin, L.R., Week, D.E., Detera-Wadleigh, S., Nurnberger, J.I. and Gershon, E.S. (1994) Chromosome 18 DNA markers and manic-depressive illness: evidence for a susceptibility gene. Proc. Natl Acad. Sci. USA, 91, 5918-5921.
23 Straub, R.E., MacLean, C.J., O'Neill, F.A., Burke, F., Murphy, B., Duke, F., Shinkwin, R. et al. (1995) A potential vulnerability locus for schizophrenia on chromosome 6p24-22: evidence for a genetic heterogeneity. Nature Genet.,11, 287-293.MEDLINE Abstract
24 Schwab, S.G., Albus, M., Hallmayer, J., Honig, S., Borrmann, M., Lichtermann, D., Ebstein, R.P. et al. (1995) Evaluation of a susceptibility gene for schizophrenia on chromosome 6p by multipoint affected sib-pair linkage analysis. Nature Genet.,11, 325-327.MEDLINE Abstract
25 Moises, H.W., Yang, L., Kristbjarnason, H., Wiese, C., Byerley, W., Macciardi, F., Arolt, V. et al. (1995) An international two-stage genome-wide search for schizophrenia susceptibility genes. Nature Genet.,11, 321-324.MEDLINE Abstract
26 Staden, R. (1994) Staden: managing sequence projects. In Griffin, A.M. and Griffin, H.G. (eds), Methods in Molecular Biology Vol. 25: Computer Analysis of Sequence Data, part II, Humana Press, Totowa, pp. 37-67.
27 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol.,215, 403-410.MEDLINE Abstract
28 Gish, W. and States, D.J. (1993) Identification of protein coding regions by database similarity search. Nature Genet.,3, 266-272. MEDLINE Abstract
29 Hillier, L. and Green, P. (1991) OSP: a computer program for choosing PCR and DNA sequencing primers. PCR Methods Applications,1, 124-128.
30 Dubois, B.L. and Naylor, S.L. (1993) Characterization of NIGMS Human/Rodent somatic cell hybrid mapping panel 2 by PCR, Genomics,16, 315-319.&form=6&uid=93300502&Dopt=r">MEDLINE Abstract
The sequence for the complete cDNA that contains candidate clone i.8 is now available in GenBank. This sequence (CAGR1, accession no. U38810) is a human homolog to the C.elegans cell fate-determining gene mab-21. In the CAGR1 mRNA, the polymorphic [CAG]n begins 221 bp 5' to the initiation codon and is not translated.
*To whom correspondence should be addressed
+These authors contributed equally to this work
This page is maintained by OUP admin. Last updated Thu Oct 31 15:25:15 GMT 1996. Part of the OUP Journals World Wide Web service.Copyright Oxford University Press, 1996