| Human Molecular Genetics | Pages |
A practical guide to orient yourself in the labyrinth of genome databases
Introduction
Resources
The Genome Database
Genetic Location Database (LDB)
Online Mendelian Inheritance in Man (OMIM)
Sequence databases
dbEST
The High Throughput Genomic (HTG) GenBank division
Human Genome Sequencing Index (HGSI)
UniGene-the Human Transcript Map
The radiation hybrid database
In Silico Positional Cloning
cDNA sequence analysis
Genomic sequence analysis
In Silico Positional Candidate
Conclusions
Acknowledgements
References
A practical guide to orient yourself in the labyrinth of genome databases
INTRODUCTION
The identification of genes involved in human genetic disease is one of the main goals of human geneticists. To date, only a limited percentage of the genes causing >5000 Mendelian disorders has been identified (1). In the last few years, the candidate gene and the positional cloning approaches have played a major role in disease gene identification. The candidate gene approach (2) relies on partial knowledge of the disease gene function. The availability of previously identified genes, whose features (e.g. sequence domains, expression pattern, etc.) suggest that they may be implicated in the disease, allows you to predict one of them as a candidate gene for the disease. In positional cloning (2), the isolation of the target gene relies exclusively on the position of the gene in the genome, often without any prior knowledge of the function of its protein product. Since for most genetic diseases no information is available on the function of the defective gene and its product, positional cloning has become the method of choice for disease gene identification. However, positional cloning is still a difficult and time-consuming approach, and recent successes in disease gene identification have relied instead on the positional candidate approach (3), a strategy which combines knowledge of the map position of the disease locus with the availability of candidate genes mapping to the same chromosomal region. This way of identifying genes is rapidly becoming the predominant method, due to the increasing number of available cloned and mapped genes (4).
The process of disease gene identification, regardless of the strategy employed, is facilitated by the progress of the Human Genome Project (5,6). The Human Genome Project is an international effort started in the mid 1980s whose main goal is the sequencing of the entire human genome and the identification of all human genes. Towards this end, the necessary preliminary steps were represented by: (i) the development of detailed physical and genetic maps; and (ii) the cloning of the entire genome in overlapping clones. Since the latter steps have been almost completely achieved (7-9), the effort is now concentrated on two parallel strategies: large-scale genomic sequencing and cDNA sequencing and mapping. Sequencing of the human genome (which will be completed by the year 2005) will provide us with more detailed and exhaustive data about the entire catalogue of human genes. However, this project is still far from completion, considering that only 3.9% of human genome sequences currently is available (http://weber.u.washington.edu/~roach/human_genome_progress2.html ). On the other hand, cDNA sequencing, and in particular the expressed sequence tag (EST) approach (10-12), has been more rewarding and undoubtedly has led to a revolutionary change in the strategies used by molecular geneticists for identifying and cloning novel genes.
ESTs are nucleotide sequences generated from the ends of randomly selected cDNA clones. The remarkable expansion of cDNA sequencing efforts in the past 7 years (10-12) has led to the generation of >1 000 000 human ESTs from different tissues and stages. The EST resource currently is being used for the generation of a radiation hybrid (RH) map of human transcripts.
The large amount of information generated by the Human Genome Project has raised the critical issue of data management. At present, there is no comprehensive database which allows easy retrieval of genomic data. In fact, all of the information is spread out in a number of different databases, which are not always cross-referenced. As a consequence, the task of retrieving genomic data is not trivial and can be quite frustrating for a scientist not familiar with bioinformatics. In this review, we will pinpoint some of the main resources available (Table 1) and illustrate the most effective ways to use them with practical examples.
Table 1.
| GDB | http://www.gdb.org/ |
| LDB | http://cedar.genetics.soton.ac.uk/public_html/ |
| OMIM | http://www.ncbi.nlm.nih.gov/Omim/ |
| GenBank | http://www.ncbi.nlm.nih.gov/Web/Genbank/ |
| EMBL/EBI Nucleotide Sequence Database | http://www.ebi.ac.uk/ebi_home.html |
| DNA Data Bank of Japan (DDBJ) | http://www.ddbj.nig.ac.jp/ |
| dbEST | http://www.ncbi.nlm.nih.gov/dbEST/index.html |
| HTGS | http://www.ncbi.nlm.nih.gov/HTGS/ |
| HGSI | http://www.ncbi.nlm.nih.gov/HUGO/ |
| Blast server at NCBI | http://www.ncbi.nlm.nih.gov/BLAST/ |
| UniGene | http://www.ncbi.nlm.nih.gov/UniGene/Hs.Home.html |
| The Human Transcript Map | http://www.ncbi.nlm.nih.gov/genemap/ |
| RHdb | http://www.ebi.ac.uk/RHdb/ |
| Human Physical Mapping Project at Whitehead Institute/MIT | http://www-genome.wi.mit.edu/cgi-bin/contig/phys_map |
| RH mapping at Stanford Human Genome Center | http://www-shgc.stanford.edu/Mapping/index.html |
| EST Assembly Machine | http://gcg.tigem.it/cgi-bin/uniestass.pl |
| ESTBlast | http://www.hgmp.mrc.ac.uk/ESTBlast/ |
| Genotator | http://www-hgc.lbl.gov/projects/genotator.html |
| Nix | http://www.hgmp.mrc.ac.uk/Registered/Webapp/nix/ |
RESOURCES
The Genome Database
During the last decade, the Genome Database (GDB) represented the main repository site for human gene mapping information (13). GDB was created initially at Johns Hopkins University in 1989 by the Howard Hughes Medical Institute. In 1991, the responsibility for funding the project was assumed jointly by the Department of Energy (DOE), the National Institutes of Health and the Japan Science and Technology Agency. A series of mirror sites in many countries helped to ensure international access to the data.
One of the major accomplishments of the project was to capture in electronic form much of the information about human genetics and gene mapping (including human genes, probes, clones and allele frequencies) accumulated by the scientific community during the two decades prior to the Human Genome Project. These data were reviewed and edited by a worldwide group of scientists to assure a high standard of quality. GDB pioneered the use of a World Wide Web (WWW) interface to access the data. However, in the last few years, the complexity of the data has grown, and various attempts to create a user-friendly interface did not prove very successful, thus causing a decrease in the use of this database by a large number of human geneticists.
More recently, the focus of the scientific community has shifted from gene mapping to high-throughput sequencing of both cDNAs and genomic clones. As a consequence, the GDB's primary sponsor (DOE) has decided to terminate the GDB project. The database will continue to be made available to the scientific community, although most data acquisition activities will cease. Even considering its limitations, the scientific community agrees that there is still a need for a database like the GDB which can constantly update and merge a variety of information generated by different genome centres and possibly integrate it with sequencing data.
Genetic Location Database (LDB)
The LDB is a database for constructing fully integrated genetic and physical maps (14). The LDB programme generates an integrated map (known as the summary map) from partial maps of physical, genetic, regional, mouse homology and cytogenetic data. One of the main advantages of this site is represented by the fact that it is very user-friendly.
Online Mendelian Inheritance in Man (OMIM)
The OMIM (1) is a catalogue of human genes and genetic disorders edited by Dr Victor A. McKusick and colleagues at Johns Hopkins and elsewhere, and developed for the WWW by the National Center for Biotechnology Information (NCBI). The database contains textual and reference information. Both a gene map and a morbid map are also available. The OMIM gene map presents the cytogenetic map location of disease genes and other expressed genes described in OMIM. Alternatively, the OMIM morbid map provides the cytogenetic map location of disease genes only.
Sequence databases
The GenBank sequence database at NCBI collects DNA sequences from all available public sources (15). The synchronization with the European Molecular Biology Laboratory (EMBL) Data Library and the DNA Data Bank of Japan provides comprehensive worldwide coverage. GenBank data are accessible through Entrez (16), a retrieval system which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. Sequence similarity searches are offered through the BLAST series of database search programs (17,18).
dbEST
dbEST is a division of GenBank that contains sequence data and other information on ESTs from a number of organisms (19). The latest release of dbEST contains >1 620 000 entries. Human and mouse are the most widely represented organisms in this collection, with 1 011 000 and 322 000 entries, respectively (as of May 19, 1998). ESTs can be retrieved from dbEST using sequence homology searches. These typically are performed using the BLAST server available at NCBI. Using as query either the nucleotide or the amino acid sequence, it is possible to verify whether identical and/or similar ESTs exist in dbEST. ESTs can also be retrieved from dbEST using keywords such as accession number, clone ID, etc.
The High Throughput Genomic (HTG) GenBank division
The HTG Sequences division of GenBank was created to store `unfinished' genomic sequence data generated by the high-throughput sequencing centres and to make them rapidly available to the scientific community (20).
A typical HTG entry might consist of all the first pass sequence data generated from a single genomic clone which together comprise >2 kb and contain one or more gaps. A single accession number is assigned to this collection of sequences and each record includes a note that the sequence data are `unfinished' and may contain errors. The accession number does not change as sequence records are updated. `Finished' HTG sequences retain the same accession number, but are moved into the relevant primary GenBank division. Sequence data in the HTG division are available for BLAST homology searches.
Human Genome Sequencing Index (HGSI)
As the Human Genome Project proceeds, it becomes more important to have a means to coordinate and track the sequencing effort. For this purpose, the NCBI created the HGSI database to collect and distribute up-to-date human genome sequencing target information submitted by the sequencing centres. A `target' is a chromosomal region, delimited by Genethon markers, which a sequencing centre is sequencing or planning to sequence.
UniGene-the Human Transcript Map
The UniGene database, contains >42 000 clusters of sequences (as of May 1998), each representing the transcription product of a distinct human gene (21). This number represents ~50% of the current estimate of genes in the human genome. The clusters are generated by comparing, against each other, both known transcripts and ESTs to determine those likely to be derived from the same gene. Besides sequence information, preliminary expression data, represented by cDNA sources used to generate the ESTs, are provided for each cluster (Fig.
Figure 1. Example of a UniGene EST cluster. The entry contains mapping and expression information, as well as the list of EST sequences belonging to the same putative transcript. Based on the UniGene clustering analysis, an international collaborative project was started systematically to map by RHs one EST per cluster, providing the public databases with regional mapping information for >16 000 putative human transcripts (22). This mapping information is available both from UniGene and from The Human Transcript Map page. The UniGene page displays links for each of the 23 chromosomes that provide a list of all the clusters that have been identified for a given chromosome and the sequences included in the cluster. As of May 1998, a chromosomal location is provided for ~11 000 clusters, in contrast to the 16 000 mapping assignments previously described (22). This discrepancy is probably due to the merging of different sequence clusters, as also supported by the common finding of multiple coincident mapping assignments for the same UniGene entry (Fig. The Human Transcript Map allows the retrieval of mapped cDNA starting either from an interval delimited by two Genethon polymorphic markers or from a selected cytogenetic band. These features make The Human Transcript Map a relevant resource for positional cloning projects (see below). Other gene map consortium sites are available (http://www.ncbi.nlm.nih.gov/SCIENCE96/ResTools.html ), in particular at the Whitehead Institute Center for Genome Research (WICGR) and Stanford Human Genome Center (SHGC), which integrate the EST mapping information into more detailed physical and genetic maps.
The radiation hybrid database
RH mapping is a somatic cell genetic approach that is well suited for the construction of high-resolution maps of the human genome (23). Using this PCR-based technique, DNA markers such as sequence tagged sites (STSs), genetic markers and ESTs can be ordered along a chromosome and the physical distance between them can be estimated. Since 1995, the European Bioinformatic Institute (EBI) maintains RHdb, a public database for RH mapping results (24). The main type of entry in the database consists of raw PCR amplification results from a particular marker on a particular panel. Cross-references to a number of related databases are also stored. These include GenBank/EMBL, dbEST, GDB and the databases of the laboratories that have submitted the data. Currently, RHdb is concerned mainly with storing and distributing data, while no maps are directly provided by RHdb. As of May 1998, >44 000 EST mapping assignments have been submitted to RHdb.
IN SILICO POSITIONAL CLONING
In silico positional cloning can be performed using two different starting points: EST databases and large-scale genomic sequencing. We will use practical examples to show some effective ways to exploit the bioinformatic resources currently available.
cDNA sequence analysis
EST databases can be used efficiently to identify positional candidate genes within a defined critical region for a human disease. For example, let us assume we want to identify the gene responsible for tibial muscular dystrophy (TMD; MIM 600334). TMD is an autosomal dominant disorder characterized by distal myopathy, mostly confined to the tibial anterior muscles, with late adult onset. The gene recently has been mapped to a 1 cM chromosomal region on the long arm of chromosome 2, between markers D2S148 and D2S2310 (25). By using the HGSI database, it is possible to verify that sequencing efforts on chromosome 2 currently are in progress at Washington University Genome Sequencing Center, but no mapping information is available yet for these sequences.
Before embarking on a classical positional cloning effort, it is important to verify the number of ESTs already known to map within the critical region or in its vicinity. The first resource to analyse is the Human Gene Map database. By using the Map Search option, it is possible to select the chromosome of interest and enter the two flanking markers delimiting the interval. In this example, both Genethon markers delimiting the TMD critical region are present in the Human Gene Map framework map. In some other cases, when the flanking markers cannot be found in the framework map, it is necessary to identify adjacent Genethon markers from databases such as GDB, LDB, etc., and to use them as queries in the map search page. The output of the search shows the presence of 49 cDNA markers within the TMD critical region, divided into different bins corresponding to sub-intervals (Fig.
Figure 2. Transcript map of the 2q31 region (D2S148-D2S2310 interval) as retrieved from the Human Transcript Map database. The figure displays only the first five (out of 49) cDNA markers localized within the TMD critical region (25).
Figure 3. Radiation hybrid maps of the 2q31 region derived from the WICGR (A) and from SHGC (B). EST and cDNA markers are indicated in red on the original output of the WICGR map.
Figure 4. Examples of integrated genomic sequence analysis performed with Genotator (A) and Nix (B). Both softwares provide a flexible system for automatically running a series of sequence analysis programs (gene prediction and regulatory sequence identification, repeat masking, sequence homology detection, etc.) on genomic sequences. The results are shown in a graphical output with colour-coded sequence annotations for both DNA strands. Retrieval of more detailed information is possible by clicking on each coloured element. Genotator runs on Unix workstations and is available free of charge for academic users. Nix allows the submission of genomic sequences using a WWW interface. It is important to be aware of a number of limitations present in this first level analysis. In fact, it is not uncommon to find EST-based STSs with a different name corresponding to the same UniGene cluster. This indicates that the 49 cDNA markers identified in the TMD interval do not necessarily correspond to an equivalent number of putative transcripts. The best option at this point is to look at the UniGene database. Using as a query either the name of the mapped cDNA marker or the accession number of a cDNA entry, it is possible to retrieve the corresponding UniGene cluster, together with a list of all related STSs generated by different mapping centres (Fig. After this first analysis, it would be theoretically conceivable to select the genes which, based on their sequence similarity or their expression profile, represent the most suitable candidates for TMD. However, in the majority of cases, the cDNA markers identified do not correspond to known full-length cDNAs but to EST clusters that are mostly anchored to the 3[prime] end of the transcripts. Therefore, an extension of the cDNA sequence is needed. This can be performed by again taking advantage of UniGene. However, one limitation of this database is that no sequence contigs are provided for the clusters. This information can be obtained using the EST Assembly Machine at Tigem (26); by entering the accession number of a human EST, you can retrieve all sequences present in the corresponding UniGene cluster and assemble them using a contig assembly program (CAP) (27). A similar tool, ESTblast (28), is available at the Human Genome Mapping Project Resource Center (HGMP-RC) server. With both tools, the consensus sequence of the contig can then be extended through repeated cycles of sequence comparison, to obtain possibly a full-length transcript. While the Human Gene Map database provides a general overview of the transcript maps currently available, other Gene Map Consortium Sites integrate EST mapping information with more refined physical and genetic maps. Figure The analysis of different maps shows the presence of discrepancies, some of which are due to the lack of a common STS and RH framework map (different centres use a different STS scaffold to build their maps). Furthermore, the absence of a univocal nomenclature for the mapped ESTs further complicates the analysis of the data.
A

B

A

B

Genomic sequence analysis
With the exponential accumulation of information arising from large-scale sequencing projects, the analysis of human genomic sequences eventually will represent the method of choice to identify human genes in the next few years. This approach has already led to the cloning of a significant number of genes responsible for human disease (29,30). However, the identification of transcribed sequences within large genomic sequences is not an easy task and relies on the use of several different bioinformatic tools, such as software for gene prediction, repeat masking, sequence homology detection, regulatory sequence identification, etc. In particular, none of the currently available software programs for gene prediction is completely reliable, thus forcing investigators to use many programs in tandem and to compare the output of each.
The simultaneous use of all these bioinformatic procedures usually leads to very large outputs whose complete analysis is extremely time consuming. Fortunately, some recently developed bioinformatic tools (such as Genotator and NIX) facilitate these analyses by integrating the output of different programs into a single and user-friendly graphical interface. Figure
IN SILICO POSITIONAL CANDIDATE
In many cases, a gene of interest can be identified without any knowledge of its mapping assignment. For instance, it is often possible to identify, by sequence homology searches, human ESTs homologous to genes already characterized in other species (32) and often associated with mutant phenotypes. If one wants to test the involvement of a newly identified human transcript in a human disease, it is often necessary to determine its mapping assignment. Again, prior to performing any experimental procedures such as fluorescence in situ hybridization (FISH) or RH mapping, it is convenient to look at the UniGene database to verify the presence of a UniGene cluster containing the EST of interest. If this is the case, there is a 25% chance (as of May 1998) that the cluster has already been mapped by RH experiments by one of the Mapping Consortium groups. A link to the Human Gene Map provides information on the two Genethon markers flanking the mapping interval, as well as the corresponding cytogenetic band. With this information, it is possible to query the OMIM database to verify if the selected cDNA is a promising positional candidate for any of the disease loci mapping to that particular genomic region.
A practical example is represented by EST F05456 (accession number), which can be identified by virtue of its sequence similarity to potassium channels genes previously identified in other organisms (Drosophila, mouse, rat). This EST corresponds to a novel putative potassium channel in humans. By searching the UniGene database using as query the EST accession number, we can immediately verify that this novel transcript maps to the long arm of chromosome 14, between markers D14S997 and D14S63 on cytogenetic band 14q23-14q24. By searching the OMIM database, it is possible to determine that this gene may be a positional candidate for two diseases whose loci have been mapped to the same genomic region, namely arrhythmogenic right ventricular dysplasia I (33) and anterior polar cataract (34).
CONCLUSIONS
The examples that we have reported above show that with the correct use of the different genomic databases currently available it is possible to obtain in a few minutes on the computer results that only a few years ago would have required several months of experimental work. Nevertheless, the exponential growth of the information generated by the Human Genome Project will require the development of more and more sophisticated bioinformatic tools and will further stimulate the integration of public genome resources.
ACKNOWLEDGEMENTS
We wish to thank Gyorgy Simon and Alessandro Guffanti for bioinformatic support, and Melissa Smith for preparation of this manuscript. The financial support of the Italian Telethon Foundation is gratefully acknowledged.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 7 Sep 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This Article ![]()
![]()
Abstract
![]()
FREE Full Text (PDF)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (8)
![]()
Request Permissions ![]()
Google Scholar ![]()
![]()
Articles by Borsani, G.
![]()
Articles by Banfi, S.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Borsani, G.
![]()
Articles by Banfi, S.
![]()
Social Bookmarking ![]()
![]()
What's this?