Human Molecular Genetics Advance Access originally published online on September 14, 2005
Human Molecular Genetics 2005 14(Review Issue 2):R171-R181; doi:10.1093/hmg/ddi335
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Interactome: gateway into systems biology
1Center for Cancer Systems Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 44 Binney Street, Boston, MA 02115, USA and 2Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
* To whom correspondence should be addressed. Tel: +1 6176323802; Fax: +1 6176325739; Email: Michael_cusick{at}dfci.harvard.edu; marc_vidal{at}dfci.harvard.edu; david_hill{at}dfci.harvard.edu
Received August 18, 2005; Accepted September 1, 2005
| ABSTRACT |
|---|
Proteinprotein interactions are fundamental to all biological processes, and a comprehensive determination of all proteinprotein interactions that can take place in an organism provides a framework for understanding biology as an integrated system. The availability of genome-scale sets of cloned open reading frames has facilitated systematic efforts at creating proteome-scale data sets of proteinprotein interactions, which are represented as complex networks or interactome maps. Proteinprotein interaction mapping projects that follow stringent criteria, coupled with experimental validation in orthogonal systems, provide high-confidence data sets immanently useful for interrogating developmental and disease mechanisms at a system level as well as elucidating individual protein function and interactome network topology. Although far from complete, currently available maps provide insight into how biochemical properties of proteins and protein complexes are integrated into biological systems. Such maps are also a useful resource to predict the function(s) of thousands of genes.
| SYSTEMATIC MAPPING OF INTERACTOME NETWORKS |
|---|
Most gene products mediate their function within complex networks of interconnected macromolecules. Studies in model organisms suggest that complex macromolecular networks have topological and dynamic properties that reflect biological phenomena (1
We consider the full interactome network as the complete collection of all physical proteinprotein interactions that can take place within a cell. Construction of comprehensive sets of proteinprotein interactions, interactomes, requires the creation of genome-scale resource collections of open reading frames (ORFeomes) cloned so as to facilitate protein expression, generated iteratively based on improved gene predictions and experimental verification and capturing all expressed isoforms (splice variants and polymorphisms). ORFeomes, as faithful representations of the encoded proteome, provide the starting material for carrying out high-throughput interaction studies that are then validated by orthogonal interaction methods. The resulting interactome maps are regarded as framework information; and by integrating other functional genomic and proteomic data sets, increasingly detailed and reliable biological models can be generated (3
).
Model organisms have provided the basis for a systematic characterization of physical proteinprotein interactions (interactome mapping). Initial efforts focused on defined biological processes or modules for the yeast Saccharomyces cerevisiae and the worm Caenorhabditis elegans (4
,5
). Subsequently, proteome-scale interactome mapping projects for eukaryotes have been carried out in yeast, worm and fly (6
10
). Current estimates for the complete yeast interactome suggest
28 000 potential protein interactions, on the basis of experimental and computational analyses (6
,7
,11
15
) along with incorporating literature-curated interactions such as those collected in the MIPS databases (16
). So far, the worm and fly interactome maps each contain approximately 5000 high-quality putative interactions derived primarily from high-throughput yeast two-hybrid (Y2H) screens (8
10
). These two data sets demonstrate the feasibility of interactome mapping projects for metazoans, and they also illustrate the power of integrating multiple approaches to model biological networks (17
). However, to fully understand human biology and the molecular mechanisms underlying diseases such as cancer, systematic experimental mapping of the human interactome itself is necessary.
Although completed genome sequences provide lists of tens of thousands of predicted unique proteins (
25 000 for the human proteome, disregarding splice variants and post-translational modifications), the sequences by themselves do not provide an understanding of the underlying principles of cellular systems. Proteome-scale information is also required at structural, functional and dynamic levels. This information should encompass various molecular networks, such as regulatory, biochemical or proteinprotein interaction networks. The initial challenge is the generation of comprehensive network maps, generally depicted as nodes (e.g. proteins, RNAs, DNA binding sites or metabolites) linked by edges corresponding to molecular interactions (e.g. proteinprotein interactions, enzymatic reactions, DNAprotein, etc.). For each network map, individual nodes and edges need to be perturbed systematically to help in understanding the logic of molecular networks involved in any biological processes of interest. As biological systems are highly dynamic and fluid, information on where and when nodes appear or disappear on where and when edges take place and on the rewiring of the network, as sub-networks appear or disappear during developmental and cell cycle stages, needs to be obtained.
Here, we review recent progress in interactome mapping, emphasizing the need for high-confidence, experimentally derived data sets to drive the construction and use of these maps as frameworks for integrating other genome-scale information, such as genetic interactions, expression profiling and phenotypic analyses.
| FUNCTIONALITY AND MULTIFUNCTIONALITY |
|---|
A significant hindrance to a comprehensive understanding of human biology, encompassing both the individual parts and the integrated whole, is the limited information available for most human genes, beyond the completed DNA sequence of the euchromatic portion of the genome (18
40% with no annotated functional element. In contrast, in S. cerevisiae nearly all genes have an assigned Gene Ontology (GO) term (20| STANDARDIZED ORF COLLECTIONS |
|---|
The nearly 250 complete genome sequences (18 eukaryotic genomes plus 230 microbial genomes) and the 234 eukaryotic genome sequences in progress or nearing completion, as of June 1, 2005, constitute an enormous, albeit admittedly substantially untapped, wealth of biological information (http://www.ncbi.nlm.nih.gov/Genomes/). However, for nearly all sequenced genomes, the exact gene count as well as the precise exonintron structure for each protein-coding gene remains incomplete, and with available approaches perhaps indeterminable, even for the well-annotated C. elegans and S. cerevisiae genomes (8
| PROTEINS, COMPLEXES AND NETWORKS |
|---|
A common theme pervading biological investigation is that most proteins generally function as components of complexes that contain other macromolecules to carry out specific biological processes, and networks of interactions connect multiple, different cellular processes (43
With 30% or more of human genes lacking functional annotation, and with just a few thousand human genes well characterized, one pressing need is to obtain comprehensive and unbiased data sets of all potential binary and complex membership interactions. Currently, two experimental methodologies are used for generating genome-scale protein interaction maps at high-throughput. They are high-throughput yeast two-hybrid (HT-Y2H) (61
63
) and analysis of protein complexes by affinity purification and mass spectrometry (APMS) (64
,65
). Yeast two-hybrid (Y2H), as a binary assay captures direct proteinprotein interactions, whereas APMS identifies components of stable complexes. Both assays individually can provide useful information on protein function by employing guilt-by-association and majority rule principles. Some information on the dynamic nature of interactions can be obtained when Y2H and APMS are combined (66
) or when interaction data is supplemented by data from expression profiling and phenotypic analyses (3
,17
).
An alternative to experimental determination of protein interactions is prediction by various computational genomics approaches (67
70
). Computational genomics utilizes information on individual protein interactions taken from publicly available databases such as BIND (71
), MIPS (72
) and HPRD (73
,74
), relies on annotations defined by Pfam (19
), GO (20
), and combines these data with sequence similarities within genomes and orthologies across genomes for in silico prediction of proteinprotein interactions (75
80
).
| IDENTIFICATION OF PROTEIN INTERACTIONS BY YEAST TWO HYBRID |
|---|
The Y2H system, as originally described by Fields and Song (81
The canonical Y2H system consists of a separable, DNA binding domain (DB) from a transcriptional activator protein (yeast Gal4 or bacterial lexA) fused to protein X, generally referred to as the bait, and a separable transcriptional activation domain (AD) fused to protein Y, termed the prey. When DB-X and AD-Y are co-expressed in the nucleus of yeast cells, X-Y proteinprotein interactions reconstitute a functional transcription factor that activates one or more reporter genes (Fig. 1A). Two outstanding virtues of Y2H are: (i) DNA, not protein, is manipulated to study both bait and prey (85
) so ORFeome resources are readily employed (8
,83
); and (ii) it is readily adapted to high-throughput methods (61
,83
,85
88
).
|
Standard Y2H generally underestimates the number of interactions, because forced subcellular localization of bait and prey in the yeast nucleus may preclude certain interactions from taking place, a particular instance being interactions involving integral membrane proteins. For interactions that require specific post-translational modifications, unless the enzymes responsible for such modification happen to be present in the yeast nucleus, the interaction may not be detectable by Y2H. For these reasons and others, the canonical Y2H has an inherent false-negative rate that limits the number of potential proteinprotein interactions (62
| TECHNICAL FALSE-POSITIVES IN Y2H |
|---|
Both experimental and computational methods for identifying proteinprotein interactions will exhibit some degree of false-positives. We consider false-positives to be of two distinct classes. Biological false-positives are those in which the interaction can be confirmed by multiple, different methods, but the two proteins are never present in the same cell or subcellular compartment at the same time. These false-positives are nearly impossible to unequivocally identify using interaction assays alone. The second class is technical false-positives that can occur in any experimental system. Early HT-Y2H experiments might have been compromised by a high rate of technical false-positives owing to specific features of the particular Y2H system available at that time. In later and ongoing HT-Y2H experiments the technical false-positive rate is substantially reduced by, among other improvements, incorporating multiple reporter genes to measure transcription activation, employing different DNA sequences for binding by DB in the promoters of the reporter genes, using low copy number vectors, and, importantly, retesting interacting pairs in fresh yeast (62
The initial genome-wide Y2H studies contained significant technical false-positives (86
), most likely because the influence of auto-activation was not recognized. The high proportion of false-positives does not necessarily diminish the impact of these studies (86
,94
,95
), but does reflect the challenges faced in moving from highly focused, reductionist approaches to global approaches that attempt to interrogate entire genomes and proteomes in an unbiased way that can capture all potential interactions.
Several computational methods have been developed in an attempt to reduce the impact of DB-X auto-activators. Auto-activators appear as sticky or promiscuous baits that have many interaction partners, which often do not share any common functional annotation. Therefore, criteria such as cut-offs for maximal number of interactors for a given bait protein or common functional annotation among interactors (9
,82
,96
98
) have been used to eliminate putative auto-activators. This strategy reduces the size and complexity of the resulting interactome network, as some of the baits or preys with large numbers of interactors or interactors that do not share common functional annotation may still represent real interactions. Alternatively, auto-activators can be, and should be, removed experimentally by first testing for prey-independent growth on selective media to remove strong auto-activators (93
), and by identifying and removing latent auto-activators during the screening process (83
,91
,99
). Once technical false-positives are removed, the quality of the resulting data set is significantly improved (10
,83
,100
).
| THOUSANDS OF INTERACTIONS: ON THE WAY TO PROTEOME-SCALE INTERACTOMES |
|---|
Large-scale Y2H screens for Helicobacter pylori (82
The two groups that conducted HT-Y2H for yeast used the same
6000 ORFs as baits, but there was less than 15% overlap in detected interaction pairs (6
,7
). The low overlap has raised the concern that Y2H is inherently noisy (98
,102
). Even more, each of the two screens had a less than 13% overlap with a literature set of high-confidence interactions compiled from single-gene analyses (2
,7
). Although the genome-wide Drosophila screen reported
5000 high-quality interactions (9
), two smaller, focused screens for Drosophila showed minimal overlap with this larger study (103
,104
), echoing the situation seen for the two yeast studies. This low overlap suggests that high-throughput screens should be repeated more than once to recover the maximum amount of interaction pairs.
It is assumed that the published literature on protein interactions encompasses the most comprehensive and best-annotated information available. However, the various curated data sets compiled from the literature show limited overlap with each other (2
,74
,83
,105
). The limited overlap among various data sets is likely due to several factors, including uneven sampling, high false-negative rates in experimental and computational protocols, and irreproducibility within and between different experimental and computational approaches (74
).
| VALIDATION OF Y2H BY ORTHOGONAL ASSAYS |
|---|
Although many interesting and novel interacting pairs have been obtained in the various global Y2H screens, any particular result should be viewed with caution until validated by another distinct method. For the yeast, bacterial and fly screens, computational methods were used to gather sets of high-confidence interactions (96
For the C. elegans and human HT-Y2H studies, stringent criteria were imposed before and during the screen (10
,83
). By rigorously eliminating auto-activators and by systematically retesting all interaction pairs, experimentally derived high-confidence Y2H data sets were obtained. To further validate Y2H interaction pairs experimentally, a co-affinity purification (co-AP) strategy was used, taking advantage of the respective C. elegans and human ORFeome collections (Fig. 2) (8
,83
). The orthogonal co-AP assay requires that the interacting pairs, expressed in mammalian cells (as opposed to yeast nuclei with the Y2H), form stable complexes that can be isolated and analyzed by immunoassay. With both the worm and human data sets, the collection of high-confidence Y2H interactions showed an
80% validation rate by co-AP (10
,83
). A recent effort at examining protein complexes biochemically based on the H. pylori Y2H data set achieved a comparable cross-validation rate (107
). Thus, experimental orthogonal approaches demonstrate that these HT-Y2H interaction data sets contain mostly highly reliable interactions.
|
| IDENTIFICATION OF INTERACTING PROTEINS BY MASS SPECTROMETRY |
|---|
There are two basic strategies for determination of protein membership in a complex, direct (purification of a stable complex and elucidation of the components of the complex by mass spectrometry) and co-AP (purification of a complex by virtue of an affinity tag placed on one of its components, then elucidation of the components of the complex by mass spectrometry). Direct analysis can identify novel members of stable complexes (108
In two distinct studies of the yeast proteome by APMS (11
,12
) there was limited overlap among the common complexes identified, and the reproducibility was approximately 70%. Interestingly, 98 complexes were previously characterized, whereas 134 complexes were novel (11
), which demonstrates the ability of APMS to identify novel interactions. The fact that interactions previously characterized in the processes of DNA damage response and signal transduction were not all recovered could be indicative of an experimental or technical bias in the basic methodology (11
). Alternatively, because many complexes involve very transient interactions and/or individual components are not readily detectable owing to low expression, APMS will underestimate the extent of complex co-membership. The lack of overlap between the two studies, along with the identification of specific proteins shared among distinct complexes, has also raised the issue of false-positives. However, a systematic analysis suggests that a majority of novel and shared components are likely to be biologically relevant (108
), which means that APMS is a reliable method for identifying novel components of complexes and for assigning putative function based on guilt-by-association and majority rule criteria.
| COMPLEMENTARITY OF APMS AND Y2H |
|---|
APMS presumably identifies interactions that occur in the native cellular environment, provided that temporal and spatial expression of the baits is normal, although purification of complexes can lead to both loss of real interactions and gain of spurious ones. In contrast, with Y2H all interactions occur in the heterologous environment of the yeast nucleus. With APMS expression of proteins in their normal cell/tissue may allow identification of post-translational modifications, but the heterologous expression in Y2H generally precludes this. However, APMS may miss weak, transient associations, whereas Y2H is a better choice for such interactions (86
| MULTIFUNCTIONALITY: REALITY DISGUISED AS FALSE-POSITIVES |
|---|
As described already, technical false-positives can be eliminated from high-throughput screens by rigorous testing and validation. However, even when technical false-positives are fully removed there can still remain interactions that are reproducibly real but do not appear to make sense based on existing functional annotation of either or both/all interactors. These may well be biological false-positives that arise from the forced out-of-context expression of baits and preys in Y2H or from overexpression of affinity-tagged proteins in APMS. Nevertheless, because an interaction does not appear to make sense does not necessarily mean that it is a false-positive. There is a growing list of moonlighting proteins performing multiple, apparently unrelated functions. Many moonlighting proteins are metabolic enzymes with additional functional activity, although numerous signaling, structural and nucleic acid binding proteins are also multifunctional (21
The existence of moonlighting proteins argues that Y2H and APMS results that initially appear to be biologically irrelevant should not be dismissed out-of-hand. Instead, novel interactions found by unbiased approaches such as Y2H and APMS may provide clues on new functional annotations for well-characterized proteins and potential multiple functions for previously uncharacterized proteins. An uncharacterized protein may connect different functions as either a moonlighting protein or through interactions with other uncharacterized proteins (105
). Thus, simple reliance on conserved domains and majority rule assessments may lead to incomplete functional annotation of both characterized and uncharacterized proteins.
| BUILD AS YOU GO: INTERACTOME WALKING |
|---|
In interactome walking interaction partners identified from an initial screen are subsequently used as baits in secondary and tertiary rounds of screening, a strategy readily accomplished by accessing cloned ORFeome collections. Interactome walking has been implemented for the DNA replication network of Bacillus subtilis (113
| COMPUTATIONAL ASSESSMENT OF HIGH-THROUGHPUT DATA SETS |
|---|
Efforts to assess biological relevance and/or reliability of HT-Y2H and HT-APMS interactome data sets have utilized phylogenetic conservation, subcellular localization patterns, co-expression correlations, and network topology (94
As none of the compiled data sets of experimentally derived proteinprotein interactions constitutes a truly comprehensive data set for any eukaryotic organism (71
74
,83
,105
), and given the high false-negative rate intrinsic to Y2H and APMS, computational prediction of interactions has also been undertaken. Computational prediction efforts have utilized conserved protein and DNA sequences, functional annotations based on GO, co-localization or homologous interactions in other species, and are complemented by mathematical modeling employing strategies such as Bayesian networks (75
78
,80
,119
122
). The daunting challenges faced by purely computational approaches are to effectively incorporate the nearly 40% of human genes with minimal or no annotation, and to incorporate multifunctional proteins, particularly when only one functional activity is established. Recent analyses of multifunctional proteins demonstrate that computational predictions can accurately identify multiple functions but also have an inherent false-positive rate (123
). Computational prediction utilizing subcellular localization data avoids potential false-positives but can lead to increased false-negatives when a protein is multifunctional and/or active in multiple subcellular compartments or complexes (48
,49
,108
).
| PROTEIN INTERACTION NETWORKS: THE INTERACTOME |
|---|
The most comprehensive interaction maps currently available are actually compilations of high-throughput Y2H and APMS appended to an assemblage of literature-reported interactions curated in various interaction databases (10
Regardless of the data source, be it APMS or Y2H, or genome-scale or smaller scale, many interconnections between distinct biological processes are noted (6
10
,82
,129
). Many of these interconnections involve novel or otherwise unknown proteins, thereby suggesting functions for these proteins. For example, an Y2H study in worm (129
) identified many novel DNA damage response genes (Fig. 3). Moreover, proteins of known function are found in interaction clusters that are of completely different function, suggesting potentially new functions for these known proteins (48
,49
,130
,131
). Some of the interacting proteins were partially characterized previously and have putative functions in line with the function of the known cluster (113
,132
), but most are uncharacterized. In addition, every global-scale Y2H screen has identified novel interactions not previously observed by focused examination of individual protein clusters.
|
| SYSTEMATIC MAPPING AND INTEGRATION OF DATA |
|---|
|
|
|---|


