Human Molecular Genetics, 2003, Vol. 12, No. 9 1037-1044
DOI: 10.1093/hmg/ddg113
© 2003 Oxford University Press
Integration of the cytogenetic map with the draft human genome sequence
Howard Hughes Medical Institute, Department of Computer Science, 321 Baskin Engineering Bldg, University of California, Santa Cruz, CA 95064, USA
Received January 7, 2003; Accepted February 25, 2003
| ABSTRACT |
|---|
|
|
|---|
Chemically staining metaphase chromosomes resulting in an alternating dark and light banding pattern provide a tool by which abnormalities in chromosomes from diseased cells can be identified. The localization of these aberrations to a chromosomal region provides clues as to which gene or genes may contribute to a particular disease. With the sequencing of the human genome, it became critical to determine the positions of these cytogenetic bands within the sequence in order to take advantage of vast amount of information now anchored to the sequence, especially the locations of genes. The molecular basis of cytogenetic bands is not well understood, therefore their positions cannot be determined solely based on sequence information. We developed a dynamic programming algorithm that employs results from
9500 fluorescence in situ hybridization experiments to approximate the locations of the 850 high-resolution bands in the June 2002 version of the draft human genome sequence. These band predictions support previously identified correlations between band stain intensity and certain structural characteristics of chromosomes, namely GC content, repeat structure content, CpG island density, gene density and degree of condensation. | INTRODUCTION |
|---|
|
|
|---|
Until the 1970s, cytogeneticists had only the size of each chromosome and the positions of their centromeres as visual guides to distinguish each chromosome and to study them. In 1968, it was discovered that metaphase chromosomes in plants could be chemically stained with quinacrine mustard or quinacrine dihydrochloride to produce a specific banding pattern (1). In 1970, an analogous banding pattern was produced for human chromosomes (2). This quinacrine banding, or Q-banding, provides landmarks allowing each chromosome to be identified more easily and enabling a more precise study of aberrations. The number of each type of chromosome in a cell can be better ascertained to reveal numerical abnormalities. Insertions, deletions, inversions, rings and translocations can be seen as changes in the normal banding pattern with specific deviations in the banding pattern providing evidence as to the actual DNA sequence affected.
By more precisely characterizing aberrations, cytogeneticists can better investigate the consequences of a particular abnormality. Most numerical abnormalities are lethal, and those few that are not result in abnormal development, such as Down's Syndrome, in which cells have an extra copy of chromosome 21 (3). Structural abnormalities can result in the improper expression of one or more genes that are responsible for, or contribute to, the phenotypes of certain diseases. The most famous example of this is the Philadelphia chromosome that was shown to cause chronic myeloid leukemia (CML) (4,5). In this case, two genes, BCR from chromosome 9 and ABL from chromosome 22, are fused together during a translocation between these chromosomes (6). The disease that results can be treated by a drug, GLEEVEC, designed to inhibit the fusion protein (7,8).
The draft human genome sequence (9) provides precise positions of curated genes from the RefSeq database (10), as well as locations of predicted genes generated computationally (1115) within each chromosome. The integration of the cytogenetic map with this draft sequence provides cytogeneticists with the necessary link to this molecular-based resource. Given a chromosomal abnormality in a diseased cell where the affected region has been cytogenetically mapped, the corresponding area in the draft sequence can be easily determined, and then investigated for potential disease genes. This is a much more efficient method of identifying genes for further study as compared with more traditional techniques such as positional cloning (16).
The molecular basis for cytogenetic bands is not completely understood. Previous research has shown that staining intensity is highly correlated with certain structural features of the chromosome such as GC content (1719), the density of LINE and SINE repetitive elements (2022), CpG island density (23), gene density (24,25), degree of condensation (26,27), and early or late replication during mitosis (18,28). It has been hypothesized that the banding pattern may be related to the scaffold-loop structure of chromosomes (29) or to the differences in GC content between neighboring regions (30). A precise formula for determining cytogenetic band locations based on this information has yet to be determined.
We employ experimental data to predict band locations in the draft sequence. This data consists primarily of results from thousands of fluorescence in situ hybridization (FISH) experiments (31) made available by the BAC Resource Consortium (32). Sequence-related information for each of the large insert clones mapped to a cytogenetic location by FISH can be obtained from the Human BAC Resource website (www.ncbi.nlm.nih.gov/genome/cyto/hbrc.shtml) hosted by the National Center for Biotechnology Information (NCBI). This information consists of either the full sequence of the clone, the sequence of one or both of the ends of the clone, and/or STS markers known to be contained in the clone for which primer sequence or full sequence information is available. Each of these provides the ability to determine a position for the clone in the draft human genome sequence, thus providing the necessary link between the cytogenetic map and the sequence.
In addition to the FISH mapped clones, we depend on two other sources of information. First, the length of each band has been estimated visually, and each band has been assigned an estimated percentage of the total bases in each chromosome (33). These percentage lengths provide rough estimates of the sizes of the bands, but due to the inaccuracies in these estimates, the incomplete nature of the draft sequence, and the presence of inaccuracies in the draft sequence, simply calculating band boundaries based on these percentage lengths creates contradictions with other experimental data, as we show in the Results section. Second, centromeres, the variable heterochromatic regions on chromosomes 1, 9 and 16, the short arms of the acrocentric chromosomes 13, 14, 15, 21 and 22, and the variable region at the q-terminus end of chromosome Y are essentially unsequenced and are represented by large gaps in the draft sequence. The locations of these gaps can be easily identified, and each corresponds to one or more specific bands. Thus, these large sequence gaps are used to anchor the cytogenetic map to the draft sequence in these locations.
We developed a dynamic programming algorithm for estimating the boundaries of each of the bands in the draft sequence. Although initially staining was done using quinicrine, staining using a Giemsa (G) dye has become standard, and thus it is this G-banding pattern that we predict. The G-banding pattern is nearly identical to the original Q-bands (34). Our algorithm is described in detail in the Materials and Methods section.
| RESULTS |
|---|
|
|
|---|
We predicted the locations of the cytogenetic bands at the 850-band resolution in the June 2002 draft human genome sequence. This resolution actually defines 862 bands, and assigns each to a class (34). The pericentromeric region class consists of 48 bands, variable length heterochromatic regions cover 17 bands, and each of the stalks in the five acrocentric chromosomes is assigned one band. The remaining 792 bands are divided into five classes based on their staining intensity. The Gpos100 class consists of the darkest staining bands, with the Gpos75, Gpos50 and Gpos25 classes containing progressively lighter staining G-positive bands. The Gneg class consists of the non-staining G-negative light bands.
Figure 1 shows ideograms for three chromosomes based on our cytogenetic band predictions alongside the standard ideograms for the 850-band resolution. Ideogram comparisons for the full set chromosomes can be found at the supplementary website www.soe.ucsc.edu/research/compbio/cytobands/. In general, the correspondence between the banding patterns is very good. There are several possible reasons for differences between the sizes of certain bands. First, the standard ideograms are based on visual estimates of the lengths of bands (34), and these may have been under- or overestimated in terms of the actual number of bases. We show later that it appears that the lengths of the darkest bands were consistently underestimated, while the opposite is true of the lighter bands. Second, the draft genome assembly is still not completely accurate, and the artefactual duplication of sequence, the omission of sequence due to an overcollapse of a region, and the incorrect estimation of the size of unsequenced regions can adversely affect the predicted size of a band. Lastly, the FISH data on which these predictions are based may contain errors leading to inaccurate predictions. In fact, we determined that more than 10% of the FISH mapped clones were in direct contradiction with one or more other clones. By this we mean that the relative position of two clones in the cytogenetic map based on FISH mapping disagreed with the relative positions of the same clones in the draft sequence.
|
These predictions can be viewed as the Chromosome Band annotation track in the UCSC Human Genome Browser (35) at www.genome.ucsc.edu. Analogous predictions are available on all of the assembled genome sequences viewable through this browser. The exact base pair locations can be downloaded from the annotation database listed on the corresponding Downloads page (http://genome.ucsc.edu/downloads.html). The cytogenetic band information in the Ensembl Genome Browser (www.ensembl.org/Homo_sapiens/) is also based on these predictions.
Using either of these genome browsers, researchers can now move directly from cytogenetic experimental results to sequence based annotations. In the UCSC Human Genome Browser, a band name can simply be entered in the position text box to view information corresponding this region in the genome including curated and predicted genes. A user's guide detailing how to use the UCSC Human Genome Browser and its many features can be found at http://genome.cse.ucsc.edu/goldenPath/help/hgTracksHelp.html.
In the process of determining band positions, we identified unique locations for 8926 FISH mapped clones in the June 2002 draft sequence using either the full sequence of the clone, STS marker, and/or BAC end sequence data associated with the clone. The results for 9528 FISH experiments for these clones were obtained from the Human BAC Resource website at www.ncbi.nlm.nih.gov/genome/cyto/hbrc.shtml. (Some clones are mapped by multiple laboratories, thus there are more experimental results than actual clones.) The locations of these FISH mapped clones can be seen in the UCSC Human Genome Browser as the FISH Clones annotation track. A particular clone's location can be found by typing in the name of the clone into the position text box. The user's guide mentioned above provides instructions on how detailed information about each clone can be viewed.
Of the clones whose position could be determined, there are 179 instances in which the chromosome assigned to the clone during the FISH experiment disagreed with the chromosome in which the clone sequence is placed in the draft human genome sequence. Another two clones are located in genomic sequence that could not be reliably placed in this assembly. This leaves 9347 FISH results that are the primary inputs to Bander, our algorithm to estimate the locations of the cytogenetic bands that is fully described in the Materials and Methods section.
Based on our predictions, we calculated the number of concordant FISH-mapped clones, where by concordant we mean that the band assigned by our algorithm to the location of the FISH clone in the draft sequence is among those bands to which the clone was FISH mapped. We determined that 8116 of the 9347 clones are concordant (86.8%). Another 669 (7.2%) clones are within 1 Mb of a band to which it was FISH mapped, and 193 (2.1%) are between 1 and 5 Mb of a FISH-mapped band. As described in the Materials and Methods section, we weighted the results from FISH experiments performed at NCI very heavily due to their use of a highly accurate technique (36). As expected, the concordance rate for these clones is very high [1206 of 1233 (97.8%) concordant, 19 (1.6%) within 1 Mb]. Of the 862 actual bands defined at this resolution, 837 had at least one clone FISH mapped to a range including that band. Of these, 826 predicted bands contained at least one concordant clone in the draft sequence with 795 containing two or more concordant clones.
For comparison, we also determined band boundaries using two other methods. For the first, we simply used the ISCN percentage lengths, although still using the centromere positions to anchor pericentromeric bands. The second predicted boundaries such that the maximum number of FISH clones would be concordant, which is basically our method minus the heavy weighting of the NCI FISH mapped clones. Table 1 compares our method with these other two in terms of concordance with all FISH clones, and specifically with NCI-mapped clones. It is easily seen that the ISCN percentage length predictions do not agree as well with the empirical data. Using the maximum concordance method does increase the number of concordant FISH clones, as it is designed to do, but this increase is slight. The percentage of concordant NCI clones is much higher for Bander compared with the other two methods. Careful examination shows that sets of clones mapped at different centers can sometimes be out of register with each other, causing contradictions in the data. Over 10% have contradictory FISH mappings given their placement in the draft assembly. Given the greater reliability of the NCI FISH mapping, Bander's predictions appear to be the most accurate due to its much higher concordance with this set of clones.
|
We have created plots to show the correspondence between predicted cytogenetic bands and the FISH clone data. The plot for chromosome 12 can be seen in Figure 2. Plots for all chromosomes are available at http://genome.ucsc.edu/goldenPath/mapPlots/.
|
The percentage of GC bases has been shown to be correlated with the band stain (1719,30) with low GC content associated with G-positive bands. Analyzing the GC content of our band predictions given a genome-wide average of 40.8% shows that this correlation strongly holds with the darkest two classes of G-positive bands, although not quite as well with the lighter G-positive staining bands. GC percentage progressively increases from a low of 36.7±1.1% for the Gpos bands, to 38.5±1.8% for Gpos75 bands, 41.2±2.8% for Gpos50 bands, and 43.5±3.6% for Gpos25 bands. While the majority of GC-richest bands are negatively staining Gneg bands, this class of bands is known to be much less homogeneous with respect to GC content (37). The average GC content in our predictions is calculated at 42.7±4.6%, which clearly reflects this heterogeneity.
Previous studies have shown that G-positive bands are enriched in LINEs and not in SINEs, while G-negative bands have the inverse properties (2022). SINE and LINE repetitive elements are identified in the draft genome sequence using RepeatMasker (A.F.A. Smit and P. Green, unpublished data). Using this information, we calculate the percentage of bases in each band that is contained in a SINE or LINE with overall genome-wide averages being 13.6% of all bases in SINEs and 21.0% in LINEs. The dark staining Gpos100 and Gpos75 bands adhere to this correlation with 25.1±5.1% LINE/8.4±1.9% SINE and 22.9±4.1% LINE/10.7±3.1% SINE, respectively. The lighter staining Gpos50 and Gpos25, along with the G-negative bands, are more ambiguous. Gpos50 consisted of 20.9±6.0% LINEs and 14.1±5.5% SINEs, almost exactly the genome-wide averages. Gpos25 is 18.1±5.7% LINEs and 18.5±7.7% SINEs showing little difference between the content of each type of repeat. The Gneg bands seemingly contradict these studies with 19.2±6.6% LINEs and 15.7±6.6% SINEs, although the LINE content is below and SINE content above the genome-wide averages. We also calculated the average length of each type of element in the five classes, and found that, among the G-positive classes, the average length of LINE elements increased and of SINE elements decreased as the stain intensity became stronger. The average lengths range between 527 bp for LINEs and 217 bp for SINEs in Gpos100 bands to 374 bp for LINEs and 225 bp for SINEs in Gpos25 bands. For Gneg bands, average LINE length is 400 bp and SINE length is 224 bp. This suggests that there is a better correlation between the lengths of each type of repetitive element and staining intensity rather than total content.
It has been reported that there exists a correlation between gene density and cytogenetic bands (24,25), with the G-negative containing a much higher number of genes than G-positive bands. A full set of gene sequences and their locations is still being compiled. At present, one of the most complete and accurate collections of genes is the set in the RefSeq database (10), which provides the corresponding mRNA sequences. Alignments of these sequences to the draft sequence were taken from the UCSC Genome Browser. The current set used contains 16 099 distinct mRNA sequences.
Gene density may be calculated in several ways. The simplest is to determine the average number of genes per band for each band type. Doing this, we find that Gpos100 bands contain 11.8 genes per band (gpb), Gpos75 bands 13.8 gpb, Gpos50 bands 16.0 gpb, Gpos25 bands 22.3 gpb, and Gneg 24.4 gpb. Another is to calculate the percentage of bases in each band class that is contained in a coding exon. Results from this calculation show that 0.27% of Gpos100 bands, 0.43% of Gpos75 bands, 0.67% of Gpos50 bands, 1.12% of Gpos25 bands, and 0.97% of Gneg bands fall into coding regions. Both of these statistics support the gene density correlation.
We also explored structural differences between genes in each of the band classes. The number of exons per gene is fairly consistent among the classes with all but Gpos100 averaging close to 10 exons per gene. The Gpos100 bands averaged slightly more at about 11.5 exons per gene. Likewise, the size of the exons did not vary greatly among classes ranging from 235 bp per exon in the Gpos25 bands to 258 bp in the Gpos75 bands. A drastic difference is seen in the sizes of introns, though. The average size of an intron in Gpos100 bands is calculated at 10355 bp, in Gpos75 bands is 8513 bp, in Gpos50 bands is 5792 bp, in Gpos25 is 4177 bp, and in Gneg is 4547 bp.
CpG islands have been shown to be found near coding regions, thus it is not surprising that there was also found a correlation between their density and the staining intensity of a band (23). Using the annotation available on the UCSC Genome Browser, we calculated the percentage of bases in each band class that are contained in CpG islands. We found that 0.17% of Gpos100 bands, 0.32% of Gpos75 bands, 0.56% of Gpos50 bands, 1.07% of Gpos25 bands and 1.04% of Gneg bands fall into CpG islands. An even stronger correlation is seen when examining the average number of CpG islands found in bands of each class, which is 14.7 for Gpos100 bands, 18.0 for Gpos75 bands, 23.3 for Gpos50 bands, 36.3 for Gpos25 bands, and 43.2 for Gneg bands.
As chromosomes condense during mitosis, it has been shown that regions corresponding to the G-positive bands condense earlier than those regions corresponding to G-negative bands (26,27). An estimate of the size of each band was previously made by measuring the sizes of the bands in stained metaphase chromosomes (33). If G-positive bands condense sooner, then we might expect that differences between the sizes of our predicted bands and these visually estimated sizes would tend to be such that our predictions contain more bases in G-positive bands, and fewer bases in G-negative bands. This is exactly what we found. On average, our predicted Gpos100 bands are 24% longer than what was previously estimated. Likewise, Gpos75 bands are 16% longer, Gpos50 bands are 9% longer, and Gpos25 bands are 4% longer. In contrast, Gneg are 2% shorter than previously estimated. (Note: G-negative bands are longer on average than G-positive bands, and the bands in acrocentric and variable heterochromatic regions are also shorter than previously estimated.) Thus there is consistent evidence that our method is effectively correcting for biases in the visually constructed band sizes.
| DISCUSSION |
|---|
|
|
|---|
The field of cytogenetics has played and continues to play a central role in human disease research (38). The discovery of the ability to chemically stain metaphase chromosomes provided a technological boost that fueled many important discoveries in the fields of disease and medicine. Though more recently developed experimental techniques such as FISH (31), comparative genomic hybridization (39), and spectral karyotyping (40) allow cytogeneticists to more precisely investigate chromosomal abnormalities, cytogenetic band analysis is still used extensively in research and medical diagnosis.
The sequencing of the human genome has provided the scaffold on which to assemble a wide range of genetic information to the benefit of research in numerous fields. In order for cytogeneticists to fully take advantage of this resource, it became essential to integrate the cytogenetic map with this sequence data to create links to past research and to guide current research. It is unreasonable to expect that band locations can be precisely resolved to a base pair level, so any predictions of band boundaries must contain some amount of error. Our predictions are supported by their agreement with visual band assignments of more than 9000 FISH mapped clones, and many known correlations between the bands and other chromosome features such as GC content, SINE and LINE content, and gene density. With the predicted locations of the cytogenetic bands that we provide, researchers can easily navigate between the physical chromosomes seen through a microscope and the corresponding genome sequence.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The band boundary problem, or estimating the locations of the cytogenetic bands in the draft sequence, can be viewed as determining an alignment of the cytogenetic map with the draft sequence. We view the problem as determining what sequence should be used to define the extent of each band. Hidden Markov models (HMMs) (41,42) have been used for many biological sequence-related problems such as gene finding (14), protein structure prediction (43), and remote protein homology searching (44). Our algorithm essentially consists of using an HMM to solve this problem, but with a few minor deviations from the strict definition.
In brief, an HMM defines a set of states, a set of symbols, and a set of transitions between states. In each state, a probability distribution is defined over the set of symbols that defines the likelihood of each symbol being emitted by the state. Another distribution defines the probability of transitioning from one state to another. Given an HMM and a sequence of symbols, we can enumerate a path of states where each state in the path emits one symbol, and this sequence of emitted symbols matches the input symbol sequence. The joint probability of observing this symbol sequence using this state path can then be calculated.
More formally, let X=(x1,... , xN) be a sequence of symbols, and let
=(
1,... ,
N) be a path of states. Let e
i(xi) be the probability of emitting symbol xi in state
i, and let a
i,
i+1 be the probability of transitioning from state
i to state
i+1. The probability of producing sequence X using state path
is
![]() |
where here
N+1 represents a special stop state that emits no symbol. A dynamic programming algorithm, called the Viterbi algorithm (45), can be used to calculate the state path
such that P(X,
) is maximized given a particular sequence of symbols, X.
For the band boundary problem, we define a set of states such that each state is associated with a particular cytogenetic band. The symbols they emit are abstracted blocks of DNA sequence from the draft sequence as described below. Thus, if a particular state emits some block of DNA sequence, this implies that the sequence is contained in the band associated with the state. Given the complete draft sequence, we want to determine the most likely state path based primarily on the information associated with the FISH mapping data. This FISH mapping data consists of clones with a position in the cytogenetic map in the form of a band or range of bands, and a corresponding position in the sequence. The final state path details the state that is used to emit each block of DNA sequence in the chromosome, thus assigning the DNA to a band and defining the boundaries of each band.
There are less than 9000 FISH mapped clones that have been placed on the draft sequence. Therefore, it is unreasonable to expect to resolve the band boundary problem down to an exact base pair. Instead, we divide the draft sequence into contiguous, non-overlapping 100 kb windows. This roughly translates into one FISH mapped clone per three windows. Our input sequence becomes the sequence of these windows for a particular chromosome. The calculated state path based on this input sequence, then, defines in which chromosome band each 100 kb window is predicted to be.
We impose a few constraints on our possible solutions to this problem through the design of the state transitions. We assign only contiguous regions of the draft sequence to a particular band. We also ensure that the band ordering in a chromosome matches standard order given in the ISCN 1995 manual (34). Further, at least one state for every band must appear in the state path forcing each band to be assigned part of the draft sequence. To impose these constraints, we create a linear ordering of the band states such that all states associated with a particular band are contiguous, and the total ordering reflects the standard band order. For example, if a chromosome has bands b1, b2, b3, and each band is associated with three states each, then the ordering is b1,1, b1,2, b1,3, b2,1, b2,2, b2,3, b3,1, b3,2, b3,3. We define transitions between the states such that a transition exists from each state to the immediately succeeding state and to the first state of the following band. In our example, b1,1 would have a transition to b1,2 and b2,1. Similarly, b1,2 would have transitions to b1,3 and b2,1, and b1,3 would have a single transition to b2,1. Thus, state b1,1 must be used ensuring that band b1 is represented in the solution, but using states b1,2 and b1,3 are optional allowing the length of band b1 to be 100 kb (only use b1,1), 200 kb (use states b1,1 and b1,2), or 300 kb (use all three states). The final determination of its length will depend on the information in the sequence.
The lengths of the bands predicted by our algorithm should not deviate drastically from the percentage lengths, even though our analysis has shown that these lengths are not completely accurate. Therefore, we calculate the size of each band based on the percentage length and the size of the chromosome in the draft sequence. We divide this size by 100 000 giving the number of states, where a state corresponds to 100 kb, that would be needed to create a band of this size. For example, if the percentage lengths estimate the size of a band at 2 Mb, we would need to create 20 states. To allow some deviation from the standard percentage length, we triple this number of states to determine the actual number of states used in the model for a particular band. For a 2 Mb band, then, we create 60 states to be associated with it. In effect, this limits the size of a band to the less than or equal to three times its percentage length size. The final state path, then, consists of only a fraction of the states created.
Given the absence of any data, we want a state path that favors predicted band boundaries that more closely reflect the percentage lengths since this is the only information available. Let B=(b1,... , bM) be the bands in a particular chromosome. Let a particular band bi have a calculated percentage length size of Lbi * 100 kb. We create 3 * Lbi states for this band, as explained above. Let bi,j define state j for band bi. We define a transition factor abi, j,bk,l between band states bi,j and bk,1 as follows:
![]() |
where p is a penalty factor for deviating from the percentage length size. These transition factors are analogous to transition probabilities, but are not true probabilities since the sum of the factors emanating from one state is greater than 1. For our predictions, we set P=0.008. These transaction factors are analogous to transition probabilities in HMMs. We call them factors since they do not define a valid probability distribution over the transitions from a state.
Instead of estimating an emission probability for each possible 100 kb sequence for each state, we define a score that reflects how likely it is that this 100 kb window is contained within the band associated with a state. This score is based on the FISH mapping data for the 100 kb window. Let X=(x1,... , xN) be the sequence of 100 kb windows for a chromosome. Each FISH mapped clone is assigned to a single xi using the midpoint of the clone's placement on the draft sequence. Let fi=( fi,1,... , fi,F) be the FISH mapped clones contained in window xi. For each window xi and each band bj, we calculate a score si,bj based on the evidence provided by fi,k that xi belongs in band bj. We do not consider each FISH mapped clone equally, but rather assign a weight wi,bj,k to each clone based on the following:
![]() |
A weight greater than 1.0 indicates that the clone supports the band assignment to some degree, and a weighting less that 1.0 indicates a disagreement. FISH experiments performed at NCI use a highly accurate technique (36), therefore we weight this data such that our band predictions will agree with this data unless it violates some other constraint, such as band contiguity. We weight data that maps a clone to a single band more than data that maps to a range of bands. Clones in which the band mapping only slightly disagrees are weighted more than those where the mapping is to a band far from bj. The exact values assigned to these weights are somewhat arbitrary, but tests indicate that slight deviations do not significantly affect the outcome. Based on this weighting scheme, we define the score si,bj as
![]() |
In order to maintain the correspondence between the large gaps and the bands that correspond to the centromeric and variable heterochromatic regions, we create pseudo-FISH mapped clones positioned in the gaps. We weight these clones such that that the 100 kb windows in the large gapped regions will be mapped to the appropriate bands.
We use the basic Viterbi algorithm to determine the most likely state path (45,46).
We use our scoring scheme instead of emission probabilities and our transition factors instead of transition probabilities. If
=(
1,... ,
N) is the path of band states for a given sequence X=(x1,... , xN) of 100 kb windows, then define the score F(X,
) of this state path as
![]() |
where b
i is the band corresponding to state
i and a
i,
i+1 is the transition factor between states
i and
i+1. The Viterbi algorithm finds the state path
that maximizes this score F(X,
). This path determines predicted band boundaries.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank the members of the BAC Resource Consortium, especially Barbara Trask, for freely providing the FISH mapping data. We thank Ensembl for providing the ideograms that are based on the band positions generated by our Bander software.
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel: +1 8314595431; Fax: +1 8314594829; Email: booch{at}cse.ucsc.edu
| REFERENCES |
|---|
|
|
|---|
- Caspersson, T., Farber, S., Foley, G., Kudynoski, J., Modest, E., Simonsson, E., Wagh, U. and Zech, L. (1968) Chemical differentiation along metaphase chromosomes. Exp. Cell Res., 49, 219222.[CrossRef][Web of Science][Medline]
- Caspersson, T., Lomakka, G. and Zech, L. (1971) The 24 fluorescence patterns of human metaphase chromosomesdistinguishing characters and variability. Hereditas, 67, 89102.
- Lejeune, J., Gautier, M. and Turpin, M.R. (1959) Etude des chromosomes somatiques de heuf enfants mongoliens. C.R. Acad. Sci. (Paris), 248, 17211722.
- Nowell, P.C. and Hungerford, D.A. (1960) A minute chromosome in human chronic granulocytic leukemia. Science, 132, 14971501.
-
Deininger, M., Goldman, J. and Melo, J. (2000) The molecular biology of chronic myeloid leukemia. Blood, 96, 33433356.
[Free Full Text] - Rowley, J.D. (1973) A new consistent chromosomal abnormality in chronic myelogenous leukemia identified by quinacrine fluorescence and Giemsa staining. Nature, 243, 290293.[CrossRef][Medline]
- Druker, B. and Lydon, N. (2000) Lessons learned from the development of an ABL tyrosine kinase inhibitor for chronic myelogenous leukemia. J. Clin. Invest., 105, 37.[Web of Science][Medline]
- Cohen, M., Williams, G., Johnson, J., Duan, J., Gobburu, J., Rahman, A., Benson, K., Leighton, J., Kim, S., Wood, R. et al. (2002) Approval summary for imatinib mesylate capsules in the treatment of chronic myelogenous leukemia. Lin. Cancer Res., 8, 935942.
- International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[CrossRef][Medline]
-
Pruitt, K.D. and Maglott, D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucl. Acids Res., 29, 137140.
[Abstract/Free Full Text] -
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., et al. (2002) The Ensembl genome database project. Nucl. Acids Res., 30, 3841.
[Abstract/Free Full Text] - Salamov, A.A. and Solovyev, V.V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genet. Res., 10, 516522.
- Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894.[CrossRef][Web of Science][Medline]
- Kulp, D., Haussler, D., Reese, M. and Eeckman, F. (1996) A generalized hidden Markov model for the recognition of human genes in DNA. In ISMB, Vol. 4. The AAAI Press, Menlo Park, CA.
- Reese, M., Kulp, D., Tammana, H. and Haussler, D. (2000) Geniegene finding in Drosphila melanogaster. Genet. Res., 10, 529538.
- Collins, F.S. (1992) Identifying human disease genes by positional cloning. In The Harvey Lectures, 19901991, Vol. 86. Wiley-Liss, New York, pp. 149164.
- Comings, D.E. (1978) Mechanisms of chromosome banding and implications for chromosome structure. A. Rev. Genet., 12, 2546.[CrossRef][Web of Science]
- Holmquist, G., Gray, M., Porter, T. and Jordan, J. (1982) Characterization of Giemsa dark- and light-band DNA. Cell, 31, 121129.[CrossRef][Web of Science][Medline]
- Bernardi, G. (1989) The isochore organization of the human genome. A. Rev. Genet., 23, 637661.[CrossRef][Web of Science][Medline]
- Manuelidis, L. and Ward, D. (1984) Chromosomal and nuclear distribution of the Hind III 1.9 kb human DNA repeat segment. Chromosoma, 91, 2838.[CrossRef][Web of Science][Medline]
- Korenberg, J.R. and Rykowski, M.C. (1988) Human genome organization: Alu, Lines, and the molecular structure of metaphase chromosome bands. Cell, 53, 391400.[CrossRef][Web of Science][Medline]
- Chen, T.L. and Manuelidis, L. (1989) SINEs and LINEs cluster in distinct DNA fragments of Giemsa band size. Chromosoma, 98, 309316.[CrossRef][Web of Science][Medline]
- Craig, J. and Bickmore, W. (1994) The distribution of CpG islands in mammalian chromosomes. Nat. Genet., 7, 376382.[CrossRef][Web of Science][Medline]
-
Korenberg, J.R. and Engels, W.R. (1978) Base ratio, DNA content, and quinacrine-brightness of human chromosomes. Proc. Natl Acad. Sci. USA, 75, 33823386.
[Abstract/Free Full Text] - Sumner, A., de la Torre, J. and Stuppia, L. (1993) The distribution of genes on chromosomes: A cytological approach. J. Mol. Evol., 117122.
- Bak, A., Jorgensen, A. and Zeuthen, J. (1981) Chromosome banding and compaction. Hum. Genet., 57, 199202.[Web of Science][Medline]
- Sen, P. and Sharma, T. (1985) Characterization of G-banded chromosomes of the Indian muntjac and progression of banding patterns through different stages of condensation. Cytogenet. Cell. Genet., 39, 145149.[Web of Science][Medline]
-
Goldman, M.A., Holmquist, G.P., Gray, M.C., Caston, L.A. and Nag, A. (1984) Replication timing of genes and middle repetitive sequences. Science, 224, 686692.
[Abstract/Free Full Text] - Saitoh, Y. and Laemmli, U. (1994) Metaphase chromosome structure: bands arise from differential folding path of the highly AT-rich scaffold. Cell, 76, 609622.[CrossRef][Web of Science][Medline]
-
Niimura, Y. and Gojobori, T. (2002) In silico chromosome staining: reconstruction of Giemsa bands from the whole genome sequence. Proc. Natl Acad. Sci. USA, 99, 797802.
[Abstract/Free Full Text] - Trask, B.J. (1991) Fluorescence in situ hybridization: applications in cytogenetics and gene mapping. Trends Genet., 7, 149154.[Web of Science][Medline]
- The BAC Resource Consortium (2001) Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature, 409, 953958.[CrossRef][Medline]
- Francke, U. (1994) Digitized and differentially shaded human chromosome ideograms for genomic applications. Cytogenet. Cell Genet., 65, 206219.[Web of Science][Medline]
- Mitelman, F. (ed.) (1995) ISCN 1995: An International System for Human Cytogenetic Nomenclature. Karger, New York.
- Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M. and Haussler, D. (2002) The human genome browser at UCSC. Genet. Res., 12, 22912300.
- Kirsch, I., Green, E., Yonescu, R., Strausberg, R., Carter, N. and Bentley, D. (2000) A systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome. Nat. Genet., 24, 339340.[CrossRef][Web of Science][Medline]
- Bernardi, G. (1995) The human genome: organization and evolutionary history. A. Rev. Genet., 29, 445476.[CrossRef][Web of Science][Medline]
- Trask, B.J. (2002) Human cytogenetics: 46 chromosomes, 46 years and counting. Nat. Rev. Genet., 3, 769778.[CrossRef][Web of Science][Medline]
-
Kallioniemi, A., Kallioniemi, O.-P., Sudar, D., Rutovitz, D., Gray, J.W., Waldman, F. and Pinkel, D. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258, 818821.
[Abstract/Free Full Text] - Schrock, E., du Manoir, S., Veldman, T., Schoell, B., Wienberg, J., Ferguson-Smith, M., Ning, Y., Ledbetter, D., Bar-Am, I., Soenksen, D. et al. (1996) Multi-color spectral karyotyping of human chromosomes. Science, 273, 494497.[Abstract]
- Rabiner, L. (1989) A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, 77, 257286.[CrossRef]
- Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge.
- Karplus, K., Sjolander, K., Barrett, C., Cline, M., Haussler, D., Hughey, R., Hold, L. and Sander, C. (1997) Predicting protein structure using hidden Markov models. Proteins Struct. Funct. Genet., 29(Suppl. 1), 134139.[CrossRef]
-
Karplus, K., Barrett, C. and Hughey, R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846856.
[Abstract/Free Full Text] - Forney, G. (1973) The Viterbi algorithm. Proc. IEEE, 61, 268278.
-
Viterbi, A. (1967) Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory, IT-13, 260269.[CrossRef]
This article has been cited by other articles:
![]() |
K.-H. Yen, C.-L. Ho, and C. Lee The analysis of inconsistencies between cytogenetic annotations and sequence mapping by defining the imprecision zones of cytogenetic banding Bioinformatics, April 1, 2009; 25(7): 845 - 852. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Kraus, C. Driesch, S. Vinokurova, E. Hovig, A. Schneider, M. von Knebel Doeberitz, and M. Durst The Majority of Viral-Cellular Fusion Transcripts in Cervical Carcinomas Cotranscribe Cellular Sequences of Known or Predicted Genes Cancer Res., April 1, 2008; 68(7): 2514 - 2522. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Brown, C. J. Kennedy, V. A. Delmar, D. J. Forbes, and P. A. Silver Global histone acetylation induces functional genomic reorganization at mammalian nuclear pore complexes Genes & Dev., March 1, 2008; 22(5): 627 - 639. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Thomas, S. E. Duke, S. K. Bloom, T. E. Breen, A. C. Young, E. Feiste, E. L. Seiser, P.-C. Tsai, C. F. Langford, P. Ellis, et al. A Cytogenetically Characterized, Genome-Anchored 10-Mb BAC Set and CGH Array for the Domestic Dog J. Hered., August 16, 2007; (2007) esm053v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Goetze, J. Mateos-Langerak, H. J. Gierman, W. de Leeuw, O. Giromus, M. H. G. Indemans, J. Koster, V. Ondrej, R. Versteeg, and R. van Driel The Three-Dimensional Structure of Human Interphase Chromosomes Is Related to the Transcriptome Map Mol. Cell. Biol., June 15, 2007; 27(12): 4475 - 4487. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Shimizu and K. Shingaki Macroscopic folding and replication of the homogeneously staining region in late S phase leads to the appearance of replication bands in mitotic chromosomes J. Cell Sci., October 15, 2004; 117(22): 5303 - 5312. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Chen, M. Sun, W. J. Kent, X. Huang, H. Xie, W. Wang, G. Zhou, R. Z. Shi, and J. D. Rowley Over 20% of human transcripts might form sense-antisense pairs Nucleic Acids Res., September 8, 2004; 32(16): 4812 - 4820. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. I. Jensen-Seaman, T. S. Furey, B. A. Payseur, Y. Lu, K. M. Roskin, C.-F. Chen, M. A. Thomas, D. Haussler, and H. J. Jacob Comparative Recombination Rates in the Rat, Mouse, and Human Genomes Genome Res., April 1, 2004; 14(4): 528 - 538. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














