| Human Molecular Genetics | Pages |
©1999 Oxford University Press |
Computational methods for the identification of differential and coordinated gene expression
Introduction
Methods
Differential expression studies: pairwise condition comparison
Detecting differential expression in tag sampling experiments
Detecting differential expression in hybridization-based experiments
The role of significance testing and the Bonferroni correction
Multi-conditional gene expression analysis
Detecting differential expression from multi-conditional expression data
The identification of coordinated gene expression: pairwise analysis
Identifying gene expression clusters
Discussion
Coordinated gene expression analysis in functional genomics
Future directions
Acknowledgements
References
Appendix
Computational methods for the identification of differential and coordinated gene expression
Received May 10, 1999; Accepted June 1, 1999
With the first complete `draft' of the human genome sequence expected for Spring 2000, the three basic challenges for today's bioinformatics are more than ever: (i) finding the genes; (ii) locating their coding regions; and (iii) predicting their functions. However, our capacity for interpreting vertebrate genomic and transcript (cDNA) sequences using experimental or computational means very much lags behind our raw sequencing power. If the performances of current programs in identifying internal coding exons are good, the precise 5[prime]->3[prime] delineation of transcription units (and promoters) still requires additional experiments. Similarly, functional predictions made with reference to previously characterized homologues are leaving >50% of human genes unannotated or classified in uninformative categories (`kinase', `ATP-binding', etc.). In the context of functional genomics, large-scale gene expression studies using massive cDNA tag sequencing, two-dimensional gel proteome analysis or microarray technologies are the only approaches providing genome-scale experimental information at a pace consistent with the progress of sequencing. Given the difficulty and cost of characterizing genes one by one, academic and industrial researchers are increasingly relying on those methods to prioritize their studies and choose their targets. The study of expression patterns can also provide some insight into the function, reveal regulatory pathways, indicate side effects of drugs or serve as a diagnostic tool. In this article, I review the theoretical and computational approaches used to: (i) identify genes differentially expressed (across cell types, developmental stages, pathological conditions, etc.); (ii) identify genes expressed in a coordinated manner across a set of conditions; and (iii) delineate clusters of genes sharing coherent expression features, eventually defining global biological pathways.
INTRODUCTION
For a long time, the common view has been that the deciphering of genomic sequence information would mostly be accomplished by means of automated computational methods, implementing a set of rules describing the architecture of genes and a finite catalogue of regulatory elements and functional signatures. With the first complete genome of a multicellular organism in hand (1), several others to follow rapidly (2,3) and a `draft' of the human genome to be completed by Spring 2000 (4), we know that this will not happen. The accomplishments of bioinformatics in the context of higher eukaryote genomes have been humbling and basic analyses such as precisely identifying the intron/exon architecture of genes and the precise boundaries of their transcript(s) are still performed with unacceptable uncertainties (5). The prediction of promoter locations (and properties) is also a notable failure (6-8).
On the other hand, bioinformatics methods become immediately more useful if they can be supplemented with some experimental knowledge (i.e. transcript maps, homologous sequences, etc.). Thus, the present successes of bioinformatics are truly in the realm of `reverse engineering', i.e. decoding the genetic information using some associated experimental insight.
In the context of functional genomics, computational methods were also expected to significantly contribute to the prediction of gene function. Here again, the results have been poor. If rich catalogues of recurrent protein motifs have been designed (9-14), their association with a precise function is often too vague to identify the precise biological pathway in which they operate. Moreover, close to 50% of all newly identified genes do not exhibit a significant similarity with a previously well-characterized homologue or fall into uninformative categories such as `protein kinases' or `transcription factors'.
As genomic sequencing was picking up, methods to monitor the expression of many genes simultaneously were also designed and progressively scaled up to allow genome-wide studies. The technique of `differential display' (15,16) and the generation of expressed sequence tags (ESTs) (17-23) have first been used for the identification of genes exhibiting marked differential expressions across tissues, development stages or normal versus pathological conditions. The original EST approach was then improved by the use of smaller, concatenated and more numerous cDNA tags (24-26). As an alternative to sequencing, cDNAs can also be identified by oligonucleotide fingerprinting (27). More recently, microarrays capable of providing hybridization-based expression monitoring of tens of thousands of genes in parallel have become available, with two main technologies competing: oligonucleotide-grafted chips (28-31) and cDNA-printed glass slides (32-34). The latter are most often used in conjunction with a two-colour fluorescence competitive assay (35,36). Numerous recent review articles have commented on the enormous potential of microarray-based transcriptome studies (see, for example, refs 37-42). However, tag-counting methods are still very popular and important results have been obtained with the EST (43-53) or SAGE (54-59) protocols.
The analysis of gene expression patterns derived from normal and pathological situations is a valuable tool in the discovery of therapeutics targets and diagnostic markers. The recognition of coordinated expression profiles between characterized or anonymous genes also enables inferences about biological pathways and gene functions to be made.
At the moment, the measurement of gene expression using microarrays or cDNA tag sampling appears to be the sole approach to gene characterization capable of matching the speed of sequencing and the scale requirement of functional genomics. In consequence, expression profiles for many genes and from multiple experimental conditions often constitute the main information (besides the sequences themselves) with which to guide the `reverse engineering' process of functional genomics. Thus, a general understanding of the various ways such data can be used becomes central. In this article, I review the concepts and methodologies involved in the interpretation of gene expression profiling experiments. Following the pioneering work of Anderson et al. (60), several recent articles have discussed the bioinformatics of large-scale expression monitoring, emphasizing computational (61-69) or data management (70-77) aspects.
METHODS
Differential expression studies: pairwise condition comparison
In the simplest experimental situation, gene expression is compared between two conditions such as normal versus pathological or control versus drug-treated. The general form of the expression data table is then:
| Condition A | Condition B | |
| Gene 1 | g1A | g1B |
| Gene 2 | g2A | g2B |
| Gene 3 | g3A | g3B |
| Gene... | g...A | g...B |
| Gene n | gnA | gnB |
In the context of EST (17) or SAGE (24) studies, the expression `intensities' gA and gB are cDNA tag counts and the genes listed in the first column are those for which at least one occurrence was observed from at least one of the libraries (A or B). The sampling sizes NA and NB (i.e. the total numbers of sequenced tags) may vary from a few hundred to several tens of thousands depending on the laboratory (or company) sequencing capacity. Accordingly, the number n of observed genes also varies from a few hundred to a few thousand.
In RT-PCR- (64) or hybridization-based (30,32) studies, the expression intensities are derived from absorbency or fluorescence `analog' signals, often normalized to a number of mRNA molecules using a known quantity of exogenous control mRNA. Here again, the total number of simultaneously studied genes ranges from a hundred to several thousands (or tens of thousands).
The lay-out of the above data table recalls a setting for which biologists might think it appropriate to use the traditional [chi]2 2×k significance test. However, this would be incorrect. The purpose of the [chi]2 computation is to test whether conditions A and B significantly differ as a whole, using the entire A and B columns of expression intensities. The question asked through differential expression experiments is different; it is to identify the peculiar genes, the expression of which significantly varies between the two conditions. At the two extremes, ubiquitous genes will exhibit no variation, while `condition-specific' (e.g. tissue-, developmental stage- or disease-specific) genes will only be detected in A or B. In this section I review and discuss the different statistical methods required to mine the expression intensity tables generated from tag sampling or microarray experiments.
Detecting differential expression in tag sampling experiments
Large tag sampling experiments are usually not replicated. This implies that the standard errors associated with each expression measurement cannot be estimated from its dispersion and that none of the standard tests requiring variance estimates (such asStudent's t-test) can be used. Fortunately, the result of randomly sampling tags from a large set of genes is very well approximated by the Poisson distribution, which implicitly provides a built-in estimate of the standard error. In this context, Audic and Claverie (61) have derived a new significance test specifically adapted to the analysis of tag sampling data. Their basic result is quite simple: for two sampling experiments A and B, involving the same total number of tags, the probability of observing gA and gB tags for a gene equally expressed in both conditions is given by:
![]() |
1 |
Small values for p correspond to large differences between gA and gB, unlikely to arise by chance if the gene under scrutiny is expressed at the same level in conditions A and B. Provided that all experimental factors are well replicated, statistically significant discrepancies (such as p << 1%) between the values of gA and gB can thus be used to point out the gene most likely to be differentially expressed.
It is worth noting that the sampling size (i.e. the total number N of generated tags) does not appear in 1 and has no direct influence on the p value. The statistical significance of the differences observed in tag counts only depends on the absolute values gA and gB. This apparent paradox is discussed in Audic and Claverie (61). The form of 1 also indicates that analyzing expression measurements in terms of gA/gB ratios (as is customary in most published works) is not appropriate, as it cannot be related to a confidence estimate. Equation 1 provides a quantitative test of our intuition that a 2-fold increase computed from a gA = 10 versus gB = 20 change in tag counts [p(20|10) = 0.014] is more robust and significant than the same ratio computed from a gA = 1 versus gB = 2 transition [p(2|1) = 0.19]. In fact, a more rigorous significance test requires the use of the cumulative form of 1. A table of [gA, gB] couples corresponding to the usual 5 and 1% significance thresholds is provided in Audic and Claverie (61). In the same work, Audic and Claverie also extended 1 to the more practical case of A versus B comparisons involving different total numbers of tags NA and NB. This significance test was successfully validated using computer simulation on real EST data. As expected, the frequency of genes falsely identified as differentially expressed was found to be less than or equal to the selected significance threshold (i.e. 5% of false identifications when using a significance p value of 5%). The general form of the significance test can be used interactively on a web site at http://igs-server.cnrs-mrs.fr or the source code obtained from the authors.
Fisher's 2 × 2 exact test (78) is also being used to analyse tag sampling experimental data, most notably the Cancer Genome Anatomy project (43). This test is traditionally used for the analysis of 2 × 2 contingency tables arising from treatment versus control experiments. To fit this test, the data corresponding to each gene in the original two-condition expression matrix must be, quite artificially, rewritten as:
| Condition A | Condition B | |
| Gene 1 | g1A | g1B |
| All others | NA - g1A | NB - g1B |
where g1A and g1B are the tag counts associated with gene 1 and NA and NB are the total numbers of tags generated in experiments A and B, respectively. On theoretical grounds, the validity of using Fisher's 2 × 2 exact test in such a setting is not clear. Rigorously, the test requires the sums of columns and rows to be known prior to the experiment. Also, the definition of the `all others' aggregated gene category is logically inconsistent, as it implies that the genes expressed and observed in conditions A and B are the same, which might be a largely incorrect assumption. In practice, however, probability values computed according to Audic and Claverie (61) or from Fisher's test are close, with the latter being slightly too conservative (i.e. a larger expression bias is required to reach a given significance threshold) (61). As for 1, the setting for Fisher's test again emphasizes that the significance of expression changes must be assessed from the tag count values themselves and not from their ratio.
Fisher's exact test is more appropriate when studying the distribution of alternative transcript forms in two different conditions. The data setting then becomes:
| Condition A | Condition B | |
| Short form of gene 1 | gSA | gSB |
| Long form of gene 1 | gLA | gLB |
now corresponding to a traditional `association' experiment (i.e. is a gene form preferentially associated with one of the conditions?) for which Fisher's 2 × 2 exact test is well suited. Such a design has been applied to a large-scale analysis of alternative polyadenylation in human mRNAs (63) using the Merck-sponsored EST (22,23) data set.
Detecting differential expression in hybridization-based experiments
Hybridization-based experiments [or quantitative RT-PCR protocols (79,80)] produce `analog' expression intensities in contrast to the `digital' nature of the tag counting protocols discussed above. The expression data matrix thus consists of real numbers such as:
| Condition A | Condition B | |
| Gene 1 | g1A,g[prime]1A,g[prime][prime]1A | g1B,g[prime]1B,g[prime][prime]1B |
| Gene 2 | g2A,g[prime]2A,g[prime][prime]2A | g2B,g[prime]2B,g[prime][prime]2B |
| Gene 3 | g3A,g[prime]3A,g[prime][prime]3A | g3B,g[prime]3B,g[prime][prime]3B |
| Gene... | g...A,g[prime]...A,g[prime][prime]...A | g...B,g[prime]...B,g[prime][prime]...B |
| Gene n | gnA,g[prime]nA,g[prime][prime]nA | gnB,g[prime]nB,g[prime][prime]nB |
where g, g[prime] and g[prime][prime] denote replicated measurements (here in triplicate) of expression intensities in A versus B conditions, normalized to a common internal control (e.g. exogenous mRNA). By definition, the genes deemed to exhibit significant expression changes will be those for which the absolute difference in the average expression intensities |gB - gA| is much larger than the estimated standard error or computed from the dispersion of g, g[prime] and g[prime][prime] measurements. Multiple independent experiments are thus essential to the assessment of significance with traditional statistical procedures such as the unrelated t-test. For a given confidence level, smaller differences will be required as the number of replicate measurements increases. For experiments performed in duplicate, |gB - gA| has to be larger than 4.3 ( is the estimated standard error) to reach the 5% significance threshold, and larger than 22.3 for the 1% significance level. For experiments performed in triplicate, the requirements are |gB - gA| > 2.8 and |gB - gA| > 5.2 for the 5 and 1% levels, respectively. Most published large-scale studies are quite elusive about measurement reproducibility and the confidence levels of the observed changes in expression are rarely assessed using standard methods. When the information is available, experiments have been done in triplicate (64), in duplicate (32,36) or only partially duplicated (http://cmgm.stanford.edu/~kimlab/ ). Some redundancy (e.g. 20 probes/gene, some probe sets duplicated) is built into the Affimetrix oligonucleotide array technology and directly used by the data acquisition software (GENECHIP; Affymetrix, Santa Clara, CA). However, this does not alleviate the need to assess the dispersion of expression intensities obtained from different chips and different complex mRNA probes.
In the above studies, the traditional methods to assess the statistical significance of the observed differences are not used. Instead, ad hoc thresholding procedures are used, resulting in the elimination of subsets of genes and expression measurements. An all-or-none `reliable' versus `unreliable' classification is thus used in place of a progressive ranking of expression changes according to p values. In the rare cases where the filtering procedure is described in enough detail, it can then be compared with a more traditional significance assessment.
For instance, Schena et al. (32), in their pioneering work on large-scale cDNA microarrays using two-colour fluorescence hybridization, adopted the rule of only retaining expression intensities for which the difference of duplicate measurements did not exceed half their average. Translated into classical statistical terms, this constraint corresponds to:
![]() |
2 |
Then, they classified `differentially expressed' genes as those exhibiting at least a 2-fold change in expression, i.e.:
![]() |
3 |
A straightforward combination of the two previous equations shows that expression changes considered to be significant may include cases for which:
![]() |
4 |
Expression differences of the order of 1.5do not reach the 5% significance threshold (which is 4.3 for duplicate experiments). Thus, the filtering procedure used in this work is not conservative enough. In their simultaneous monitoring of 1000 genes, Schena et al. (32) found 15 genes exhibiting expression ratios of ~2. For purely statistical reasons, we expect that a fraction of those genes might be `false positive' rather than bona fide differentially expressed genes. I noticed that, in their first study with 46 Arabidopsis cDNAs (36), the same authors adopted a more conservative gB/gA [ge] 5 ratio as their threshold for differential expression. Combined with their reproducibility constraint (2), this higher ratio is valid and does ensure that (on average) <2.5% of the calls for differentially expressed genes will be due to random fluctuations.
The role of significance testing and the Bonferroni correction
Large-scale hybridization experiments require numerous manipulations to produce the final expression data matrix. Various calibration steps are for instance needed to ensure the linearity of the fluorescence measurements and internal controls are used to transform fluorescence intensities into number of mRNA molecules. Various controls (e.g. `housekeeping' genes) are also used to verify the consistency of expression intensities obtained from different mRNA pools and different microarrays. At the end, mathematical conversions (using offsets, logarithms, ratios, etc.) are used in the production of the final expression data. These many calibrations and normalization procedures might convey a false sense of security, in particular for protocols such as the elegant two-colour competitive hybridization assays in which error correction mechanisms may appear to be built in. Yet, if one wishes to associate a confidence level with the measured changes, it has to be clear that no data processing or elegant protocol can substitute for the requirement of multiple (at least two) independent determinations of the expression intensities.
Confidence levels offer a rational way to output and interpret the results of large-scale differential expression experiments. As discussed in Audic and Claverie (61), p values constitute an objective measure of the quality of the evidence (that a gene is differentially expressed) and can be used to rank the candidate genes (as in the output of database similarity searches) and to prioritize further analyses. For instance, the confidence levels associated with couples of expression intensities (in number of tags) such as [gA = 10, gB = 26, ratio 2.6] and [gA = 1, gB = 5, ratio 5] point out the former as far better evidence for differential expression. Similarly, the application of significance testing to microarray data would sort out the best candidate genes among confusing combinations of ratios and expression levels. Lowly expressed genes with expression ratio gB/gA [ap] 5 might then become less promising than highly expressed genes with gB/gA [ap] 1.5.
The use of confidence levels is also relative to the number of genes simultaneously tested. Given a (small) probability p that a result will occur by chance (i.e. its significance threshold), there is a probability:
![]() |
5 |
for this result to occur at least once in N independent trials. Thus, if candidate genes are selected on the basis of expression changes significant at the 5% level, a false prediction rate equal to 5% of the total number of assayed genes is expected. For a 1000 gene array, this is 50. Choosing a p value threshold thus corresponds to fixing the level of acceptable risk (i.e. the fraction of false leads an experimenter is willing to tolerate).
Conversely, significance testing can be used in the traditional way, i.e. to `demonstrate' the reality of an observation. In this case, the experimenter will have to apply the so-called `Bonferroni' correction when fixing its significance threshold. This simply consists of imposing a p value such as:
| Np << 1 | 6 |
to ensure the absence of `false positives'. Given the large (and increasing) number of genes tested in microarray or in tag experiments this corresponds to very small p values (e.g. 5%/10 000 = 5 × 10-6). Unfortunately, the constraints on the magnitude of expression changes and on the measurement accuracy required to achieve such a high confidence level might not be experimentally feasible. Strict application of the Bonferroni correction could discard many biologically significant changes.
Multi-conditional gene expression analysis
The various technologies allowing parallel expression monitoring of large sets of genes are now being applied to the study of development and differentiation (43,49,54,64,68,81; http://cmgm.stanford.edu/~kimlab/ ) and of the transcriptional response to various factors in yeast (31,33,56,65,82-85) and mammalian (32,65,86) cells. In this context, the data take the form of a multi-conditional expression intensity matrix:
| Condition A | Condition B C | ondition C | Condition... | Condition Z | |
| Gene 1 | g1A | g1B | g1C | g1... | g1Z |
| Gene 2 | g2A | g2B | g2C | g2... | g2Z |
| Gene 3 | g3A | g3B | g3C | g3... | g3Z |
| Gene... | g...A | g...B | g...C | g... | g...Z |
| Gene n | gnA | gnB | gnC | gn... | gnZ |
where the various A-Z conditions might correspond to a time series (i.e. after stimulation), successive stages of differentiation, various stages of a disease process (cancer), growth conditions or tissue or cell types. The same analysis might also be used to investigate the transcriptional effects of drugs or gene transfer.
As before, the expression intensities g may consist of cDNA tag counts (i.e. each condition corresponds to the sampling of a different library) or analog values obtained from quantitative RT-PCR, microarray-based protocols or even protein two-dimensional gel electrophoresis.
Detecting differential expression from multi-conditional expression data
Obviously, the results of a multi-conditional expression experiment can still be used to identify differentially expressed genes by comparing gene expression levels between any pair of conditions. However, the proper Bonferroni correction will have to be applied to assess the statistical significance of the results. For an expression intensity table with M conditions, the correction factor (i.e. multiplying the probability for a result occurring by chance) is M(M - 1)/2.
Greller and Tobin (67) have proposed a new and robust computational method for the identification of `selective expression' (i.e. a pattern in which the expression is markedly high or markedly low in a single particular condition) from multiple condition expression data. The method combines assessments of the reliability of expression measurements with a statistical test of expression profiles. They consider that measurements in at least 10 different conditions are required to make a reliable assessment of exceptional gene expression intensities.
Beyond the detection of differential expression, however, two new types of analyses can be performed using multi-conditional expression data, namely: (i) the identification of pairs of genes exhibiting coordinated expression; and (ii) the clustering of genes according to their expression profiles.
The identification of coordinated gene expression: pairwise analysis
Each row of a multi-conditional expression matrix corresponds to a gene expression profile, technically a vector in a space of M dimensions (where M is the number of assayed conditions). A given gene is thus represented as:
| gene i = { giA, giB, giC, gi...,..., giZ} |
Two genes can exhibit various forms of `coordinated expression' (Fig. 1). At the qualitative level, they might tend to be expressed together (genes 1 and 3) or exclude each other (gene 2 versus genes 1 and 3). At the quantitative level, their abundance might follow a linear dependency or a more complex relationship (quadratic, sigmoidal, exponential, etc.). A simplified statistical procedure to identify pairs of genes exhibiting correlated expression is described in the Appendix. The basic principle behind all methods is that coordinated expression will be suspected when the expression profiles of two genes are more similar (or more dissimilar) than expected by chance. Coordinated expression is thus inferred through pairwise comparisons of all rows (gene profile vectors) in the expression data matrix.
Figure 1. Example of expression profiles (fictitious data). Gene 1 = {g1A, g1B, g1C, g1D, ..., g1H}, gene 2, gene 3 and gene 4 vectors are represented as profiles using their expression intensities as coordinates for the various conditions (or time points) A-H. The profiles for genes 1 and 3 have a similar overall shape, suggesting a correlated expression. The profile for gene 2 exhibits the opposite variation, suggesting an anti-correlation with genes 1 and 3. Thus, genes 1-3 illustrate coordinated expression patterns. The profile for gene 4 indicates an expression pattern independent of the other three genes.
Various methods can be used to assess the pairwise similarity of gene expression profiles. For tag sampling experiments, for instance, 2 × 2 co-expression contingency tables can be computed from the multi-conditional expression data, such as:
| Gene 2 detected | Gene 2 not detected | |
| Gene 1 detected | In 10 libraries | In 2 libraries |
| Gene 1 not detected | In 2 libraries | In 8 libraries |
For the above example, the application of Fisher's 2 × 2 exact test would indicate that a significant association exists between the occurrence of genes 1 and 2 in the panel of the sampled 22 libraries. The same design is capable of detecting anti-association as well. However, reducing the tag counts to a binary scale (detected versus not detected) is only advisable when the quantitative data are unreliable, such as in the case of normalized libraries.
For both tag sampling and `analog' expression data, Spearman's rank correlation test provides a way to assess the overall shape similarity of two expression profiles. For each gene, the conditions are ranked according to expression intensities. For instance, gene 1 follows the decreasing order C, D, E, B, F, G, A, H (Fig. 1). Genes associated with `parallel' profiles, for instance gene 3, correspond to a similar order: C, E, D, F, B, A, G, H. Spearman's rank correlation rs is simply a linear correlation computed on the ranked expression intensities of genes 1 and 3, as in the table:
| Rank 1 | Rank 2 | Rank 3 | Rank 4 | Rank 5 | Rank 6 | Rank 7 | Rank 8 | |
| Gene 1 | g1C | g1D | g1E | g1B | g1F | g1G | g1A | g1H |
| Gene 2 | g3C | g3E | g3D | g3F | g3B | g3A | g3G | g3H |
Kendall's significance test can also be used for the purpose of assessing a correlation in rank order. While the ranking procedure may cause important information loss in the data, it is well suited to the detection of strongly non-linear correlations between two genes.
Nevertheless, the traditional Pearson's linear correlation coefficient gives good results in most cases and has been used in many studies (32,65,84,86-88). This test can detect pairs of genes with similar (gene i = gene j), proportional (gene i [prop] gene j) or opposite (gene i = -[prop] gene j) expression profiles. Pearson's linear correlation coefficient is also associated with a p value which assesses the confidence level for suspected coordinated expression.
The results of a complete pairwise correlation analysis can be summarized in a matrix of `expression similarity' across all genes, such as:
| Gene 1 | Gene 2 | Gene 3 | Gene... | Gene n | |
| Gene 1 | r11 | r12 | r13 | r1... | r1n |
| Gene 2 | r21 | r22 | r23 | r2... | r2n |
| Gene 3 | r31 | r32 | r33 | r3... | r3n |
| Gene... | r...1 | r...2 | r...3 | r... | r...n |
| Gene n | rn1 | rn2 | rn3 | rn... | rnn |
There are a number of ways of visualizing these results. They involve different methods of classifying the genes that exhibit correlated expression patterns into `similarity clusters'. The simplest procedure will use the property of transitivity: if gene i is significantly correlated with gene j and gene j significantly correlated with gene h, then gene i, j and h are put in the same cluster. Unfortunately, this method is very sensitive to the choice of an arbitrary threshold and to the uncertainty of pairwise gene correlation assessment. A better procedure consists of using the concept of Euclidean distance (square root of the sum of the squared differences in each dimension) to transform the above gene correlation table into a matrix of bona fide pairwise distances (88). For instance, the distance between genes 1 and 2 is computed as:
![]() |
7 |
An important point is that the distance d1,2 between gene 1 and gene 2 now takes into account their similarity with all other genes and is no longer computed from a single pairwise comparison. Such a distance is then less sensitive to the random fluctuation of expression measurements. Using Euclidean distances, two genes exhibiting a poor pairwise correlation might still appear close by virtue of their correlation patterns with other genes. Indeed, other types of Euclidean distance can be computed from the multi-conditional expression matrix, for instance by directly using the expression intensities for each condition as coordinates (64).
Once a set of bona fide pairwise distances is available, a number of clustering methods can be applied to reveal subsets of genes obeying similar patterns of expression. These methods are discussed in the next section.
Identifying gene expression clusters
Pairwise gene distance matrices computed from expression analysis or from sequence alignments are similar mathematical objects. Methods traditionally used for molecular phylogeny can thus be used to identify clusters of genes sharing a similar expression pattern. Wen et al. (64), for instance, used the Fitch algorithm (89) in the Phylip package (90,91) to interpret the temporal gene expression of 112 genes during the development of the rat central nervous system. The resulting tree clearly identified four major subsets of genes sharing four types of expression profile. A hierarchical (i.e. tree building) method adapted to the direct clustering of the correlation matrices mentioned above has been used by Eisen et al. (65) to analyse a 12 point time course of the serum response of 8600 human genes and a 75 condition expression study of the whole yeast genome. For other examples of the use of hierarchical clustering see Schena et al. (32) Lashkari et al. (33) and Khan et al. (81).
As pointed out by Tamayo et al. (68), hierarchical methods have their shortcomings and are best suited to situations of actual hierarchical descents (such as in molecular evolution). Fortunately, the design of clustering methods is a well-established field of research and a large number of alternative procedures are possible and can be used.
Principal component analysis (PCA), for instance, can be directly applied to a matrix of multiple condition expression intensities to compute the relative positions of genes and project them into the most discriminant three- or two-dimensional space (64,92). Visual inspection or pattern recognition software can then be used to delineate expression clusters.
Computer scientists are also beginning to specifically adapt classical graph theory-based clustering techniques (93) to the problem of analysing `noisy' expression data. For instance, the corrupted clique-graph model and the cluster affinity search technique (CAST) proposed by Ben-Dor and Yakhini (94) were successfully tested on two large multiple experiment data sets. The HCS algorithm proposed by Hartuv et al. (95) to cluster collections of cDNAs on the basis of their oligonucleotide fingerprints is also a good example of a graph theoretical technique designed to tolerate large stochastic perturbations.
The traditional hierarchical approaches, principal component analyses and the graph theoretical techniques cited above, are all examples of clustering procedures not requiring any assumption of the number of clusters sought.
Other methods, such as the self-organizing map (SOM) procedure adopted by Tamayo et al. (68) require fixation of the number of `nodes' (which will serve as nucleation points for the genes clusters), as well as their initial geometry in the space of the multi-conditional expression intensities. However, SOMs can be computed very quickly, even on a large data set. An iterative procedure can thus be implemented to explore the underlying cluster structure and converge to an optimal partition.
Finally, the most sophisticated clustering methods are those aiming at inferring causal relationships and regulatory mechanisms from multi-conditional expression measurements: the genes are still partitioned into clusters, but the partition now has an internal structure involving inhibitory or activating interactions. Genes belonging to the same cluster are now integrated into a coherent pathway. Initially developed for the analysis of chemical reactions (96) or the interpretation of complex genetic networks (97), such clustering approaches are now adapted to the analysis of large-scale expression experiments and to the modelling (or `reverse engineering') of transcriptional regulatory pathways (98,99). In a recent work, Chen et al. (100) addressed the problem of identifying a small set of candidate regulatory genes from multiple time series of expression measurements. Using Boolean circuits to model biological pathways, Karp et al. (101) are tackling the problem from the other end, by designing algorithms for choosing the most appropriate conditional expression experiments (e.g. gene disruption or external stimulus) that will reveal underlying regulatory networks.
Clustering methods and graphical displays are two closely related aspects of the interpretation of multi-conditional expression experiments. After hierarchical clustering, for instance, trees and colour maps are the most natural representations (64,65,92). A colour map is designed as follows (Fig. 2). Given the gene hierarchy, the rows (i.e. the genes) of the primary data matrix can be re-ordered by placing the genes sharing similar expression profiles next to each other. With the exception of time series, a similar re-ordering procedure can also be applied to the columns (e.g. tissue type, growth condition or pathological samples) most similar in terms of gene expression. The re-ordered primary data table can then be displayed by colouring each cell on the basis of intensity, variability, gene function, etc. Visual inspection of the resulting colour map will often reveal domains of similar or contrasting colours or remarkable shapes, eventually suggesting new regulatory pathways as well as disease or differentiation mechanisms. Standard image processing techniques (e.g. contrast enhancing, boundary detection, etc.) may be used to supplement human natural talent for pattern recognition.
Figure 2. Example of a colour map derived from rice EST data. (A) Colour map generated from a gene expression correlation analysis (88) of publicly available rice EST data (707 genes in 10 cDNA libraries). EST counts are represented according to the colour scale shown below the map. The colour scale has been chosen so as to optimally represent the [5-55] range of EST counts (cells with a value >55 are assigned the colour red). The green and yellow arrows point to groups of genes with specific expression patterns. The green arrow points to a set of genes mostly expressed in library 307 (green shoot, 8 days old) and the yellow arrow to genes mostly expressed in library 193 (etiolated shoot, 8 days old). (B and C) Different magnified regions of the colour map. To the right of each region, genes are shown with their putative identification, if available. To the left of each region, the relevant portion of the tree used to re-order the original data table is shown. Different colour scales were used in (A), (B) and (C), to provide optimal contrast in the display of the EST counts.
DISCUSSION
Coordinated gene expression analysis in functional genomics
The elegant two-hybrid system assay (102) is one of the most popular techniques in functional genomics. Given the cDNA of a protein of interest, this technique allows the identification of other proteins capable of interacting with it directly from a pool of target cDNAs. In this assay, the specific physical interaction between the probe and target proteins is directly used to trigger a reporter gene.
The computational analysis of a multi-conditional gene expression experiment can be seen as an extension of this technique, using the statistical interaction between the expression data of genes, rather than the physical interaction between their products.
The network of interaction revealed by the computational technique may encompass genes involved in the same biological pathway in a non-contiguous manner, as well as genes negatively interfering with each other.
In practice, the detection of correlated/coordinated gene expression nicely complements sequence-based bioinformatics methods in three main ways:
- in assigning a precise biological pathway to a gene of `generic' function (such as transcription factor or kinase);
- in relating an anonymous gene to better characterized genes;
- in revealing unexpected relationship between previously known genes or pathways.
Gene expression correlation analyses might also considerably help sequence-based bioinformatics approaches in the study of eukaryotic promoters. Among genes exhibiting correlated expression patterns across a large panel of biological conditions, a significant fraction is expected to be co-regulated, i.e. responsive to a common set of expression factors. The promoters of these genes should then contain common regulatory elements. Thus, the identification of gene expression clusters constitutes precious accessory information for the `reverse engineering' of the very elusive architecture of eukaryotic promoters. This opportunity did not escape the attention of Brazma et al. (103), who have systematically analysed the upstream regions of yeast genes exhibiting similar expression profiles. Completion of the genomic sequence of the nematode Caenorhabditis elegans should allow a similar study to be run on a multicellular organism, using the large gene expression data set (~150 conditions) generated by Kim and co-workers (http://cmgm.stanford.edu/~kimlab/ ).
It is worth noting that the computational approaches reviewed here can be applied at the protein level. In the search for correlations, cDNA microarrays and gene expression levels are simply replaced by two-dimensional electrophoresis and protein spot intensities (60,104). In the few comparative studies that have been performed, important discrepancies have been noted between expression measurements made at the transcriptome versus the proteome level (104,105). It has been suggested that protein spot signatures correlate better with phenotype than gene expression intensity (104).
Finally, the same computational techniques are also being used in the information intensive, massively parallel, drug screening protocols (106-108). Euclidean distances, hierarchical clustering and colour maps are very familiar concepts in the interpretation of large-scale (e.g. 49 000 compounds tested against 60 cell lines) molecular pharmacology studies (108).
Future directions
I have distinguished three levels in the interpretation of multi-condition gene expression data:
(i) the identification of differentially expressed genes;
(ii) the identification of pairwise gene expression correlations;
(iii) the delineation of gene clusters according to gene expression patterns. The computational methods to accomplish step 1 are well worked out and the complexity of the task is only of the order of N, the total number of genes analysed in the multi-conditional expression experiment.
Methods to perform step 2 are also well defined. Given the a priori unknown mathematical form of the correlation, the best approach would certainly involve the use of a variety of tests, each of them best suited to the recognition of a specific type of dependency (Pearson's, Spearman's, mutual information, etc.) (87). Provided a large number of conditions are tested, it is also conceivable that a relationship more subtle than a simple monotonic dependency (i.e. pairs of genes always going up or down together, or the opposite) might become detectable (e.g. pairs of genes positively correlated in some conditions and negatively correlated in some others). In any case, the identification of pairwise correlation has a complexity of the order of N(N - 1)/2 and is not a computational difficulty for modern computers.
The real challenge remains in the clustering step, for which algorithmic approaches abound, but the best choice is not clear to biologists. For most clustering methods, the complexity is of the order of Nlog(N), and again is not a real computational difficulty. However, experimental errors and the complexity of the underlying regulatory network structure require that arbitrary similarity thresholds or minimal graph connectivity rules are incorporated in the practical implementation of all algorithms. The difficulty with clustering is thus not in the design, or the choice, of a perfect method, but rather in the fact that all algorithms will fail for an unknown fraction of the cases and that there is no simple way to decide which will perform best for an arbitrary (experimental) data set. Bioinformatics research is thus very active in this area, but the prospect is poor that alternative clustering protocols will produce vastly different biological results from the same multi-conditional expression data.
Given the complexity of regulatory networks, it is also not clear whether clustering by traditional methods (PCA, hierarchical clustering, etc.) is adding more to our understanding of biological pathways than the simple knowledge of all pairwise gene expression correlations. In practice, most academic or industrial biologists will be mining multi-conditional gene expression data with an idea in mind and look for correlations with previously defined genes, such as specific tumor suppressors, cytokines or membrane receptors, in a search for surrogate markers or alternative targets. Improving clustering might thus be both difficult and biologically pointless.
The true, more biologically relevant, exciting future of gene expression clustering might thus lie in the more abstract researches attempting a complete reverse engineering of transcriptional regulation networks. Only such approaches have the potential to produce an integrated view of the cell pathways from the intricate combination of individual gene expression patterns. Detailed modelling of signalling pathways and the establishment of causal relationships are required to model developmental mechanisms and elucidate, for example, how commonly used signalling pathways are able to elicit tissue-specific responses in multicellular organisms. However, current works (101,109) in this area are only at the preliminary stage of defining which constraints must be fulfilled to allow a complex regulatory network architecture to be inferred from gene expression patterns. It is thus too early to tell whether this goal will ever be reachable from a reasonable number of experiments.
ACKNOWLEDGEMENTS
I wish to thank C. Abergel, S. Audic and F. Gosse for reading the manuscript and R. Ewing and A. Ben Kahla for allowing the use of their results in Figure 2 prior to publication.
REFERENCES
Appendix
Identifying coordinated gene expression from perfect binary data
We start from a binary (0/+) multi-conditional gene expression matrix such as:
| lib0 | lib1 | lib2 | lib3 | lib4 | lib5 | lib6 | lib7 | lib8 | lib9 | |
| Gene 0 | + | + | + | + | + | 0 | 0 | 0 | 0 | 0 |
| Gene 1 | 0 | 0 | 0 | 0 | 0 | + | + | + | + | + |
| Gene 2 | 0 | + | 0 | + | 0 | 0 | + | 0 | + | 0 |
| Gene 3 | + | 0 | + | 0 | + | 0 | 0 | + | 0 | 0 |
| Gene 4 | 0 | 0 | + | 0 | 0 | + | 0 | + | 0 | + |
| Gene 5 | + | + | + | + | + | 0 | 0 | 0 | 0 | 0 |
| Gene 6 | 0 | 0 | 0 | 0 | 0 | + | + | + | + |







