Human Molecular Genetics, 2001, Vol. 10, No. 7 663-667
© 2001 Oxford University Press
Genome and genetic resources from the Cancer Genome Anatomy Project
1Duke University Medical Center, Durham, NC 27710, USA and 2Cancer Genomics Office, National Cancer Institute, Bethesda, MD 20892, USA
Received 10 January 2001; Revised and Accepted 19 January 2001.
| ABSTRACT |
|---|
|
|
|---|
The Cancer Genome Anatomy Project (CGAP) is a collaborative network of cancer researchers with a common goal: to decipher the genetic changes that occur during cancer formation and progression. The project brings together several recent technologies capable of high-throughput analysis to help achieve this goal. Automated sequencing of cDNA libraries is a primary focus and is geared towards providing a comprehensive and annotated set of human and mouse transcribed sequences. This effort includes full-length transcript sequence generated by CGAPs new Mammalian Gene Collection initiative. Single nucleotide polymorphisms (SNPs) within human gene sequences (Genetic Annotation Initiative) and chromosomal rearrangements within cancer cells (Cancer Chromosome Aberration Project) are also being cataloged as part of CGAP. Finally, to help determine gene expression patterns related to cancer, CGAP provides a quantitative catalog of data through its SAGEmap initiative. The genome and genetic analysis tools listed in this review are all freely distributed by CGAP (http://cgap.nci.nih.gov/) without restriction.
| INTRODUCTION |
|---|
|
|
|---|
Cancer is the result of an accumulation of pathologic alterations to a cells genetic material. To begin to fully understand the molecular changes that occur during oncogenesis the normal genome and the genetic alterations that result in malignancy must be fully characterized. The Cancer Genome Anatomy Project (CGAP) has as its mission the ambitious task of deciphering the molecular anatomy of the cancerous cell (1,2). Launched in 1997 by the National Cancer Institute, CGAP has taken advantage of new technologies to help complete a catalog of genetic changes related to cancer. Collaborative networks of scientists who share these common goals form this project. Furthermore, a central guiding philosophy with all CGAP initiatives is that all data generated should be immediately available to the scientific community without restrictions. The databases resulting from this project are available through a series of web sites and online tools created to help disseminate this information in an increasingly user-friendly fashion (Fig. 1). Physical clones and libraries generated by CGAP are also made available through a network of distributors. These resources bring high-powered genomic tools to both small and large cancer research laboratories worldwide. It is hoped that the accelerated understanding of cancer that these new tools and approaches bring will move us closer to a more complete molecular understanding of malignant diseases.
|
| GENE SEQUENCE RESOURCES |
|---|
|
|
|---|
The initial raw material that has supplied the CGAP effort is sequence data derived from cDNA templates. Large-scale sequencing of cDNA libraries has proven to be a rapid and effective method to access transcribed regions of the human genome (3). The Merck/Washington University EST project made one of the first large-scale efforts to disseminate EST sequence data (4) and CGAP has succeeded this effort with its Tumor Gene Index, contributing over one million human cDNA sequence reads to online databases. For the model organism geneticist, a mouse tumor gene index has been recently started with more than 130 000 EST reads. CGAP generated sequence data is made immediately available through the CGAP web site or through the NCBIs sequence resources, such as GenBanks dbEST database (http://www.ncbi.nlm.nih.gov/dbEST/index.html) or as part of UniGene sequence clusters (http://www.ncbi.nlm.nih.gov/UniGene/). All of the physical plasmid clones constructed and sequenced as part of CGAP are available through the IMAGE consortium network (http://image.llnl.gov/) and various distributors.
Nearly half of the EST sequences deposited in GenBank are from CGAP. These ESTs combined with other cDNA sequence data yield almost 85 000 different transcript clusters in the present UniGene build. A significant fraction of these clusters are a result of CGAPs efforts to target a wide range of normal and transformed tissues. Due to an emphasis on the molecular characterization of cancer, expressed sequence data has been deposited from 117 different cancerous and 13 different pre-cancerous cell types. This wide range of expressed sequence data provides a means to annotate the human transcriptome and to help detect the expressed portions of genomic data being generated by the Human Genome Project. With the CGAP Gene Finder Tool (http://cgap.nci.nih.gov/Genes), genes or lists of genes can be located by symbol, GenBank numbers, chromosomal location, tissue of origin, function (via curated lists), keyword and eventually by molecular pathway.
Single pass high-throughput sequencing of cDNA libraries, however, has its limitations. In particular, gaps created by priming libraries from either the 5' or 3' end of clones and errors in the raw data greatly confound efforts to annotate accurate full-length gene sequence. To address the need for a standard set of full-length human and mouse cDNA sequences the Mammalian Gene Collection was launched in 1999 as a collaboration between CGAP and the Human Genome Project (5). So far, over 7500 full-length human sequences have been annotated.
| SEQUENCE VARIATIONS |
|---|
|
|
|---|
With the exception of identical twins, no two humans have an identical genomic DNA sequence. Not only is the normal sequence of a transcript important, but also the pattern of normal sequence variation. These polymorphisms provide the basis for genetic trait mapping, including allelic variation that predisposes one to cancer. The CGAP Genetic Annotation Initiative (6) is an effort to catalog single nucleotide polymorphisms (SNPs) in human expressed sequence. A combination of re-sequencing of genes with a probable role in cancer and informatics-based mining of public EST data (7) is used to identify SNPs for CGAP. Mining existing sequence data with greater than 10 reads from the same transcribed region yielded predicted polymorphisms with an 82% confirmation rate (7). This data mining process for the SNPpipeline has been an efficient means to produce a human SNP map. The ease with which large numbers of candidate SNPs can be identified using the SNPpipeline has led to a need for rapid validation of SNPs. The chip-based matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry system has recently been shown to be a rapid means for assaying large sets of SNPs and was used to validate 9115 gene SNPs from the public data (8). Other means, including an electronic dot blot on a semiconductor (9), show promise for rapid discrimination of SNPs.
| GENE EXPRESSION |
|---|
|
|
|---|
The pattern of gene expression changes that occur as a cell progresses from normal to malignant can provide insight into the molecular changes that occur during cancer progression. Although there have been a variety of recent innovations for large-scale determination of gene expression patterns, CGAP has initially adapted two main strategies.
EST-based expression
EST data serve a dual purpose of determining coding nucleic acid sequence and revealing the presence of the sequenced transcripts in the RNA used for library construction. With over 200 different CGAP cDNA libraries sequenced, plus hundreds more of other useful public cDNA libraries, there exists useful information on the different transcripts regarding their tissue of origin. To identify genes based on mRNA expression in cDNA libraries, a Boolean type search can be performed using the CGAP Expression Profiler (Table 1). Using the same raw data, the UniGene Digital Differential Display (http://www.ncbi.nlm.nih.gov/UniGene/ddd.cgi) further takes into account sample size and fractional representation for a statistical treatment of the data. Although the presence of a transcript in a particular library can be revealing, the absolute level of gene expression is lost when cDNA libraries are normalized or subtracted, frequently done to aid in discovering rare transcripts. For the non-normalized cDNA libraries it is expensive to generate sufficient numbers of clones for a statistically significant comparison of gene expression levels.
|
Serial analysis of gene expression
In order to complement EST data and to provide a more efficient means for archiving quantitative expression profiles, CGAP adopted serial analysis of gene expression (SAGE) technology (10), starting in 1998 (11). SAGE is a method to count transcripts efficiently in large numbers. Typically, 50 000 or more transcripts are assayed for a given tissue. This is achieved by isolating only a small portion of the cDNA transcript, known as a SAGE tag, forming concatamers of the tags, and sequencing 20 to 30 tags in one reaction. Tag counts are digitally archived and statistically significant comparisons of expression levels can be made between tag counts derived from different populations of cells. Since SAGE is a sequence-based technology, it was readily adapted for integration into the CGAP pipeline. The digital and absolute nature of this data naturally lends itself to a large-scale collaborative project. SAGE libraries constructed at different times or in different laboratories can be accurately compared, resulting in a powerful cumulative database.
SAGE data generated by CGAP, or deposited by other investigators, can be accessed through the SAGEmap web site (11) (Fig. 2). From a total of 171 692 sequencing runs, over 3.4 million valid transcript tags have been processed from 84 different malignant and normal cell types. Online tools built specifically to handle SAGE data (12) allow users to make statistically-based comparisons between libraries to find differentially expressed genes using the xProfiler, or by downloading data for local analysis. SAGE tags can be mapped to UniGene clusters via SAGEmap, making the identification of a gene from a differentially expressed tag much easier. The SAGE data generated through this project is also used to create a Digital Northern tool, where the expression level of a particular gene can be determined for each of the tissues used to make SAGE libraries. Expression comparisons based on SAGE have the additional advantage that no normalization to a housekeeping gene or a reference standard is necessary, since absolute levels of transcripts are compared.
|
In addition to a survey of tumor and normal tissues for the construction of SAGE libraries, experimentally manipulated cells and matched controls have been included in order to determine genes involved in fundamental processes of cancer formation. For example, the effects on gene expression of certain oncogene amplification or reintroduction of a tumor suppressor can be determined by comparing the appropriate libraries on SAGEmap. Changes in gene expression resulting from altering the in vitro environment can be determined for changes in oxygen tension or variations in growth factor concentrations. Libraries designed to determine the genes involved in drug resistance have recently been added to the CGAP project through SAGEmap. Since the expression differences between tumor and normal comparisons are extensive (13) and the functional context of the change usually unknown, these controlled comparisons help determine gene expression changes that can point to a specific function.
| CHROMOSOMAL REARRANGEMENTS |
|---|
|
|
|---|
The Cancer Chromosome Aberration Project (CCAP, http://www.ncbi.nlm.nih.gov/CCAP/) is a CGAP supported initiative designed to help researchers define the structure of chromosomes and to characterize rearrangements that occur during malignant transformation (14). The CCAP plan is to systematically integrate cytogenetic and physical maps of whole human chromosomes. The approach that is employed is high-resolution fluorescence in situ hybridization of bacterial artificial chromosome clones mapped to 12 Mb intervals in the human genome and the resulting map accessed through the CCAP Clone Maps web page (Table 1). Genetically and physically mapped BAC clones of interest can be identified using this resource and ordered through a commercial distributor.
CCAP has also created an online version of the Mitelman Database (Table 1), used to disseminate a compilation of known chromosomal rearrangements associated with cancer (15). This database was first established prior to the existence of CGAP and has been extremely useful for determining the frequency and structure of recurrent translocations in hematological malignancies, in particular, as well as some solid tumors (16). Over 36 000 different patient cases and their cytogenetic information, representing 97 different histological types of cancers, are all cataloged on this site. The database can be queried by tumor type, patient information and/or cytogenetic characteristics to determine frequencies of balanced and unbalanced translocations.
| CONCLUSIONS |
|---|
|
|
|---|
CGAP has developed into a central location for some important genetics and genomics tools. By making transcribed human and mouse sequence available, genome-based cancer research, as well as other molecular biology research, is greatly accelerated. Gene expression patterns and chromosomal rearrangements can provide important clues to the nature of molecular aberrations that lead to cancer. SNP base resources have the promise of allowing whole-genome genetic mapping of risk-conferring sequence alterations. Knowledge of gene expression patterns or molecular changes may also have utility for development of therapies. Providing immediate release of high-throughput data via the Internet is an economical approach for providing this valuable data to researchers worldwide.
Gene expression analyses supported in part by CGAP data releases have already started to yield important insights into the molecular mechanisms of tumor formation. A combination of EST library and SAGEmap data mining, followed by experimental confirmation, was recently used to find endothelial-specific genes (17). SAGE expression comparisons performed by St Croix et al. (18) were used to help identify a large set of endothelial-specific genes and determine which are differentially expressed in tumor versus normal tissue. Data derived from the SAGEmap project have also been used to identify tumor markers for glioblastomas (19) and ovarian cancer (20).
The vast amount of data generated by CGAP and other genomics initiatives have generated a need for improved methods of analyzing and mining this data. Methods have been developed for improved comparison of gene expression from cDNA libraries (21) and mining EST libraries for differentially expressed genes (22). Enhanced bioinformatics for viewing or analyzing EST or SAGE data have also been developed (23,24). SAGE data partially derived from CGAP were used to help derive a large-scale analysis of human transcript expression from 3.5 million transcripts (25). A recent bioinformatics analysis of 2.4 million SAGE transcript tags has revealed clustering of expressed transcripts to chromosomal domains and an online tool is provided to observe gene expression by chromosomal region (26). It is likely there will be an accelerated understanding of the cancer genome by applying improved bioinformatics to CGAP and other data sets.
What does the future hold for CGAP? A nearly complete annotation of transcribed human sequences is within the grasp of the scientific community. The informatics challenge of this task is significant, but the combination of a complete human genome sequence and a catalog of transcribed sequences from most tissues will provide the necessary raw materials. An integration of precise gene expression patterns from the major cancers and normal cell types should provide an increasingly powerful tool for cancer research. Genetic variation within these genes provides an opportunity to identify inherited risk-conferring genes, which CGAP plans to exploit through its Genetic Annotation Initiative. Unfortunately, molecular analysis techniques have not yet provided a truly efficient high-throughput means to directly screen for cancer-causing somatic mutations. However, it may be reasonable to contemplate such an effort in the near future based on technological advances. As CGAP continues to mature there will be an increasing emphasis on improved information technology designed specifically to extract as much useful information as possible from the large amount of raw data generated. Finally, it will be up to the individual researchers who use these resources to make advances and discoveries to demonstrate ultimate utility of CGAP supported research tools.
| FOOTNOTES |
|---|
+ To whom correspondence should be addressed. Tel: +1 919 684 3250; Fax: +1 919 681 2796; Email: greg.riggins@duke.edu
| REFERENCES |
|---|
|
|
|---|
1 Strausberg, R.L., Dahl, C.A. and Klausner, R.D. (1997) New opportunities for uncovering the molecular basis of cancer. Nature Genet., 15, 415416.
2 Strausberg, R.L., Buetow, K.H., Emmert-Buck, M.R. and Klausner, R.D. (2000) The cancer genome anatomy project: building an annotated gene index. Trends Genet., 16, 103106.[ISI][Medline]
3 Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. and Venter, J.C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet., 4, 373380.[ISI][Medline]
4 Williamson, A.R. (1999) The Merck Gene Index project. Drug Discov. Today, 4, 115122.[ISI][Medline]
5 Strausberg, R.L., Feingold, E.A., Klausner, R.D. and Collins, F.S. (1999) The mammalian gene collection. Science, 286, 455457.
6 Clifford, R., Edmonson, M., Hu, Y., Nguyen, C., Scherpbier, T. and Buetow, K.H. (2000) Expression-based genetic/physical maps of single-nucleotide polymorphisms identified by the cancer genome anatomy project. Genome Res., 10, 12591265.
7 Buetow, K.H., Edmonson, M.N. and Cassidy, A.B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nature Genet., 21, 323325.[ISI][Medline]
8 Buetow, K.H., Edmonson, M., MacDonald, R., Clifford, R., Yip, P., Kelley, J., Little, D.P., Strausberg, R., Koester, H., Cantor, C.R. and Braun, A. (2001) High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc. Natl Acad. Sci. USA, 98, 581584.
9 Gilles, P.N., Wu, D.J., Foster, C.B., Dillon, P.J. and Chanock, S.J. (1999) Single nucleotide polymorphic discrimination by an electronic dot blot assay on semiconductor microchips. Nature Biotechnol., 17, 365370.[ISI][Medline]
10 Velculescu, V.E., Zhang, L., Vogelstein, B. and Kinzler, K.W. (1995) Serial analysis of gene expression. Science, 270, 484487.
11 Lal, A., Lash, A.E., Altschul, S.F., Velculescu, V., Zhang, L., McLendon, R.E., Marra, M.A., Prange, C., Morin, P.J., Polyak, K. et al. (1999) A public database for gene expression in human cancers. Cancer Res., 59, 54035407.
12 Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J. and Altschul, S.F. (2000) SAGEmap: A public gene expression resource. Genome Res., 10, 10511060.
13 Zhang, L., Zhou, W., Velculescu, V.E., Kern, S.E., Hruban, R.H., Hamilton, S.R., Vogelstein, B. and Kinzler, K.W. (1997) Gene expression profiles in normal and cancer cells. Science, 276, 12681272.
14 Kirsch, I.R., Green, E.D., Yonescu, R., Strausberg, R., Carter, N., Bentley, D., Leversha, M.A., Dunham, I., Braden, V.V., Hilgenfeld, E. et al. (2000) A systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome. Nature Genet., 24, 339340.[ISI][Medline]
15 Mitelman, F., Mertens, F. and Johansson, B. (1997) A breakpoint map of recurrent chromosomal rearrangements in human neoplasia. Nature Genet., 15, 417474.
16 Johansson, B., Mertens, F. and Mitelman, F. (1991) Geographic heterogeneity of neoplasia-associated chromosome aberrations. Genes Chromosomes Cancer, 3, 17.[ISI][Medline]
17 Huminiecki, L. and Bicknell, R. (2000) In silico cloning of novel endothelial-specific genes. Genome Res., 10, 17961806.
18 St Croix, B., Rago, C., Velculescu, V., Traverso, G., Romans, K.E., Montgomery, E., Lal, A., Riggins, G.J., Lengauer, C., Vogelstein, B. and Kinzler, K.W. (2000) Genes expressed in human tumor endothelium. Science, 289, 11971202.
19 Loging, W.T., Lal, A., Siu, I.M., Loney, T.L., Wikstrand, C.J., Marra, M.A., Prange, C., Bigner, D.D., Strausberg, R.L. and Riggins, G.J. (2000) Identifying potential tumor markers and antigens by database mining and rapid expression screening. Genome Res., 10, 13931402.
20 Hough, C.D., Sherman-Baust, C.A., Pizer, ES.S., Montz, F.J., Im, D.D., Rosenshein, N.B., Cho, K.R., Riggins, G.J. and Morin, P.J. (2000) Large-scale serial analysis of gene expression reveals genes differentially expressed in ovarian cancer. Cancer Res., 60, 62816287.
21 Stekel, D.J., Git, Y. and Falciani, F. (2000) The comparison of gene expression from multiple cDNA libraries. Genome Res., 10, 20552061.
22 Schmitt, A.O., Specht, T., Beckmann, G., Dahl, E., Pilarsky, C.P., Hinzmann, B. and Rosenthal, A. (1999) Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. Nucleic Acids Res., 27, 42514260.
23 Larsson, M., Stahl, S., Uhlen, M. and Wennborg, A. (2000) Expression profile viewer (ExProView): a software tool for transcriptome analysis. Genomics, 63, 341353.[ISI][Medline]
24 Margulies, E.H. and Innis, J.W. (2000) eSAGE: managing and analysing data generated with serial analysis of gene expression (SAGE). Bioinformatics, 16, 650651.
25 Velculescu, V.E., Madden, S.L., Zhang, L., Lash, A.E., Yu, J., Rago, C., Lal, A., Wang, C.J., Beaudry, G.A., Ciriello, K.M. et al. (1999) Analysis of human transcriptomes. Nature Genet., 23, 387388.[ISI][Medline]
26 Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.-C., van Asperen, R., Boon, K., Voûte, P.A. et al. (2001) The Human Transcriptome Map: clustering of highly expressed genes in chromosomal domains. Science, 291, 12891292.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H. Alvarez, A. Corvalan, J. C. Roa, P. Argani, F. Murillo, J. Edwards, R. Beaty, G. Feldmann, S.-M. Hong, M. Mullendore, et al. Serial Analysis of Gene Expression Identifies Connective Tissue Growth Factor Expression as a Prognostic Biomarker in Gallbladder Cancer Clin. Cancer Res., May 1, 2008; 14(9): 2631 - 2638. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Karolchik, R. M. Kuhn, R. Baertsch, G. P. Barber, H. Clawson, M. Diekhans, B. Giardine, R. A. Harte, A. S. Hinrichs, F. Hsu, et al. The UCSC Genome Browser Database: 2008 update Nucleic Acids Res., January 11, 2008; 36(suppl_1): D773 - D779. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Bianchetti, Y. Wu, E. Guerin, F. Plewniak, and O. Poch SAGETTARIUS: a program to reduce the number of tags mapped to multiple transcripts and to plan SAGE sequencing stages Nucleic Acids Res., September 25, 2007; 35(18): e122 - e122. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Vallender, J. E. Paschall, C. M. Malcom, B. T. Lahn, and G. J. Wyckoff SPEED: a molecular-evolution-based database of mammalian orthologous groups Bioinformatics, November 15, 2006; 22(22): 2835 - 2837. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Matsunaga and M.-a. Muramatsu Knowledge-based computational search for genes associated with the metabolic syndrome Bioinformatics, July 15, 2005; 21(14): 3146 - 3154. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. Ekman, W. W. Lorenz, A. E. Przybyla, N. L. Wolfe, and J. F.D. Dean SAGE Analysis of Transcriptome Responses in Arabidopsis Roots Exposed to 2,4,6-Trinitrotoluene Plant Physiology, November 1, 2003; 133(3): 1397 - 1406. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. P. DeYoung, M. Tress, and R. Narayanan Identification of Down's syndrome critical locus gene SIM2-s as a drug therapy target for solid tumors PNAS, April 15, 2003; 100(8): 4760 - 4765. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Strausberg and S. L. Schreiber From Knowing to Controlling: A Path from Genomics to Drugs Using Small Molecule Probes Science, April 11, 2003; 300(5617): 294 - 295. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Boon, E. C. Osorio, S. F. Greenhut, C. F. Schaefer, J. Shoemaker, K. Polyak, P. J. Morin, K. H. Buetow, R. L. Strausberg, S. J. de Souza, et al. An anatomy of normal and malignant gene expression PNAS, August 20, 2002; 99(17): 11287 - 11292. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







