Chapter Nine - SHOPIN: Semantic Homogeneity Optimization in Protein Interaction Networks

https://doi.org/10.1016/bs.apcsb.2015.07.004Get rights and content

Abstract

Protein interaction networks (PINs) are argued to be the richest source of hidden knowledge of the intrinsic physical and/or functional meanings of the involved proteins. We propose a novel method for computational protein function prediction based on semantic homogeneity optimization in PIN (SHOPIN). The SHOPIN method creates graph representations of the PIN augmented by inclusion of the semantics of the proteins and their interacting contexts. Network wide semantic relationships, modeled using random walks, are used to map the augmented PIN graphs in a new semantic metric space. The method produces a hierarchical partitioning of the PIN optimal in terms of semantic homogeneity by iterative optimization of the ratio of between clusters dissimilarities and within clusters similarities in the new semantic metric space. Function prediction is done using cluster wide-hierarchy high function enrichment. Results validate the rationale of the SHOPIN method placing it right next to state-of-the-art approaches performance wise.

Introduction

Proteins serve as building blocks and functional components of every living cell. These macromolecules take part in some of the most important functions in an organism, such as constitution of the organs, the catalysis of biochemical reactions (necessary for metabolism), and the maintenance of the cellular environment. Thus, the knowledge of protein's functions plays an important role in а broad range of processes like development of new drugs, better understanding of diseases, invention of synthetic biochemicals, etc. Early approaches for computational protein function prediction (Weaver, 2002) were low-throughput because of the enormous experimental and human effort that was required for analyzing a specific gene or protein. The breakthrough of high-throughput experimental technologies resulted in producing wide variety of useful data, ranging from simple protein sequences to complex proteomic data, such as gene expression data sets and protein interaction networks (PINs). However, these improvements come at a price of having a vast amount of proteins with unknown functions. The main challenge lies in developing cost-effective procedures to analyze such massive data which uncover their intrinsic physical and/or functional meaning. This is prompting the design of computational models capable of discovering hidden knowledge within the high-throughput proteomic data as one of the important challenges of the postgenomic era.

The concept of protein function is highly context-sensitive and is not clearly defined. It can be interpreted as an umbrella term for all the cellular, molecular, or physiological activities that a protein is involved in. There are different notational schemes for the protein function that can be organized either in a hierarchical structure (EC (Webb, 1992), SCOP (Murzin, Brenner, Hubbard, & Chothia, 1995), CATH (Orengo et al., 1997)) or in a graph-like structure (Gene Ontology (GO); Ashburner et al., 2000, Gene Ontology Consortium, 2006). GO is a structured and controlled vocabulary, based on solid computer science and biological principles, for describing gene and protein products and as such is an important step in the computational protein function prediction. GO characterizes proteins in three major aspects stored as separate ontologies within the GO: biological process, molecular function, and cellular component. Each ontology consists of a set of terms (GO terms), connected to each other in a directed acyclic graph. At this moment, GO can be recognized as the most applied functional annotation scheme across a wide variety of biological data (Jensen et al., 2003, Letovsky and Kasif, 2003) and as such is the scheme considered in our research.

The computational prediction of protein function is a challenging problem, and can be characterized by several key difficulties (Pandey, Kumar, Steinbach, & Meyers, 2012): (1) the number of function terms is rather large, (2) each protein can be annotated with several terms, and (3) the terms are hierarchically organized and are unbalanced. Taking into account the nature of the underlying proteomic data being analyzed, various computational models have been proposed to address one or more of these issues (Cesa-Bianchi et al., 2012, Jones et al., 2014, Pandey et al., 2012, Radivojac et al., 2013). The effectiveness of the approaches depends on the input data, i.e., the hidden knowledge that can be discovered from the data.

Protein–protein interaction (PPI) data produced by high-throughput techniques, such as microarray coexpression analysis and yeast 2-hybrid experiments, is one of the most commonly used sources for enriched representation of the biological data for an organism, since these are fundamental to almost all biological processes (Hakes, Lovell, Oliver, & Robertson, 2007). The PPI data have the nature of networks having proteins as nodes and pairwise interactions as links between the nodes. These networks are referred to as PINs. Such network representation provides a richer data format for protein function prediction compared to protein's sequence or structure data alone. The protein function prediction in terms of PIN can be seen as the process of understanding the proteins context in the PIN. This process can be augmented by taking into account the content and/or topology of the network and thus perform a more informed decision on the potential function(s) of a query protein. According to the approach being applied, PIN-based function prediction can be categorized as: (1) neighborhood-based (Hishigaki et al., 2001, McDermott et al., 2005, Schwikowski et al., 2000), where protein annotations are transferred from the most “dominant” annotations among the protein's neighbors; (2) global optimization-based (Deng et al., 2003, Nabieva et al., 2005), where the neighboring proteins may not contain enough information, so the functions of the query protein are inferred from the indirectly connected proteins; (3) clustering-based (Brandes et al., 2003, Dunn et al., 2005), where protein's functions are taken from the functional majority of the module (cluster) where the query protein belongs; and (4) association-based (Hu et al., 2005, Tan et al., 2005), similar to clustering approaches, but here functional modules are hypothesized from frequently occurring sets of interactions in PINs of protein complexes.

Clustering-based approaches are perhaps the most common approaches for global network analysis, since studying proteins in isolation is not enough for full understanding of the cell—(clusters of) interactions need to be delineated as well, since proteins work with other proteins to regulate and support each other for specific functions (Von Mering et al., 2002). Traditional graph-based clustering methods use different metrics to partition an unweighted PIN, but unfortunately they produce few giant core clusters with many tiny ones (Barabasi & Oltvai, 2004). This can be avoided by weighting the PIN according to its topological properties (Arnau et al., 2005, Friedel and Zimmer, 2006, Pereira-Leal et al., 2004, Rives and Galitski, 2003). Integration of additional PPI information in the calculation of the PIN weights is used in Luo et al. (2007), while more sophisticated weighting schemes combined with optimization techniques can be found in Bader and Hogue (2003), Enright, Van Dongen, and Ouzounis (2002), and King, Pržulj, and Jurisica (2004). Neighborhood density was used in Bader and Hogue (2003) to assign weights to nodes and then to find densely connected regions by outward traversal starting from a highly weighted protein. The numbers of intercluster and intracluster edges are used in King et al. (2004) to compute a cost function that is optimized by a local search algorithm. Random walks and Markov chains are combined in Enright et al. (2002) to simulate a flow on the graph during which currents in edges are strengthened or reduced which leads to partitioning in the graph. An insight view of the inherent organization of protein complexes can be provided by the core-attachment structure (Gavin et al., 2006). This characteristic has been successfully applied in the methods such as COACH (Wu, Li, Kwoh, & Ng, 2009) and CORE (Leung, Xiang, Yiu, & Chin, 2009) for detecting protein complexes from PINs.

Besides the topology-based weights, the edges of the PIN can be weighted using semantic information contained within its nodes. A protein in a PIN is annotated with one or more functional GO terms. These GO term sets can be used to infer semantic relationships between the proteins in a PIN. Most recent approaches to computational protein function prediction exploit the concept of semantic similarity measures to facilitate the decision-making process. PROCOMOSS (Mukhopadhyay, Ray, & De, 2012) uses a multiobjective evolutionary approach in which graphical properties as well as biological properties based on GO semantic similarity measure are considered as objective functions for detecting protein complexes in a PIN. CSO (Zhang et al., 2013) can effectively take advantage of the correlation between frequent GO annotation sets and the dense subgraph for protein complex prediction clustering. Modularity optimization algorithms are employed on a semantically enriched PIN in Trivodaliev, Bogojeska, and Kocarev (2014). Results presented in these papers are proof of the significance of the PIN contents for the clustering performance and for the function prediction.

Here, we present a novel clustering method for computational protein function prediction based on semantic homogeneity optimization in the PIN. We refer to this approach as semantic homogeneity optimization in PIN (SHOPIN). The aims of the proposed method are manifold. We give different graph representations of the PIN each having a different level of inclusion of the semantics of a protein and its interacting context in the PIN. We further model the network wide semantic relationship by mapping the previously defined graph representations into a new semantic metric space using random walks. These two steps are general and can be employed in any given clustering-based computational protein function prediction. In our approach, the clusters of the graph representing the PIN are produced by iteratively increasing the ratio of between clusters dissimilarities and within clusters similarities, and since we are clustering in the semantic metric space the produced result will be optimal in terms of semantic homogeneity of the final clusters. Once the final clusters are obtained, the protein function prediction is performed using cluster-based function enrichment. The performance of the SHOPIN method is evaluated in terms of its ability to correctly predict an annotation term set for a query protein. This evaluation is done using a leave-one-out method which iterates through all proteins of the PIN, having a single protein as a query protein per iteration. Each iteration predicts GO terms (protein functions) for the query protein and predictions are compared with the a priori knowledge about the query protein annotation GO term set.

Section snippets

Materials and Methods

In Fig. 1, we give the overview of the framework for our SHOPIN method consisting of four consecutive steps. As previously stated, each of these steps is independent and in this proposed architecture one can substitute any step with any other approach/algorithm. We first give information about the data used in our experiments, and then define the details used in our implementation of the framework.

Results and Discussion

In order to evaluate the performance of our method, we tested and compared it with several state-of-the-art clustering algorithms. The competing clustering algorithms we tested against are the most recently developed methods which are classified as having the best performances (Lancichinetti & Fortunato, 2009). The FC (Clauset, Newman, & Moore, 2004) and the BGLL (Blondel et al., 2008) algorithms infer the community structure of a network by greedy optimization of the modularity function (

Conclusions

High-throughput technologies produce vast amounts of useful data about proteins and their role in the complex mechanisms of the cell at an unprecedented rate, imposing as imperative the creation of cost-effective computational analysis of such data. PINs are paramount for the computational protein function prediction. This process of understanding the context of a protein in the PIN naturally exploits the graph properties of the PIN. We present a SHOPIN that exploits both the content and the

References (61)

  • T. Fawcett

    An introduction to ROC analysis

    Pattern Recognition Letters

    (2006)
  • A.G. Murzin et al.

    SCOP: A structural classification of proteins database for the investigation of sequences and structures

    Journal of Molecular Biology

    (1995)
  • C.A. Orengo et al.

    CATH—A hierarchic classification of protein domain structures

    Structure

    (1997)
  • V. Arnau et al.

    Iterative cluster analysis of protein interaction data

    Bioinformatics

    (2005)
  • M. Ashburner et al.

    Gene ontology: Tool for the unification of biology

    Nature Genetics

    (2000)
  • G.D. Bader et al.

    BIND—A data specification for storing and describing biomolecular interactions, molecular complexes and pathways

    Bioinformatics

    (2000)
  • G.D. Bader et al.

    An automated method for finding molecular complexes in large protein interaction networks

    BMC Bioinformatics

    (2003)
  • A.L. Barabasi et al.

    Network biology: Understanding the cell's functional organization

    Nature Reviews. Genetics

    (2004)
  • V.D. Blondel et al.

    Fast unfolding of communities in large networks

    Journal of Statistical Mechanics: Theory and Experiment

    (2008)
  • U. Brandes et al.

    Experiments on graph clustering algorithms

    (2003)
  • B.J. Breitkreutz et al.

    The GRID: The general repository for interaction datasets

    Genome Biology

    (2003)
  • N. Cesa-Bianchi et al.

    Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference

    Machine Learning

    (2012)
  • A. Chatr-Aryamontri et al.

    MINT: The molecular INTeraction database

    Nucleic Acids Research

    (2007)
  • A. Clauset et al.

    Finding community structure in very large networks

    Physical Review E

    (2004)
  • M. Deng et al.

    Prediction of protein function using protein–protein interaction data

    Journal of Computational Biology

    (2003)
  • R. Dunn et al.

    The use of edge-betweenness clustering to investigate biological function in protein interaction networks

    BMC Bioinformatics

    (2005)
  • S.S. Dwight et al.

    Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)

    Nucleic Acids Research

    (2002)
  • A.J. Enright et al.

    An efficient algorithm for large-scale detection of protein families

    Nucleic Acids Research

    (2002)
  • T.S. Evans et al.

    Line graphs, link partitions, and overlapping communities

    Physical Review E

    (2009)
  • T.S. Evans et al.

    Line graphs of weighted networks for overlapping communities

    The European Physical Journal B

    (2010)
  • C.C. Friedel et al.

    Inferring topology from clustering coefficients in protein–protein interaction networks

    BMC Bioinformatics

    (2006)
  • A.C. Gavin et al.

    Proteome survey reveals modularity of the yeast cell machinery

    Nature

    (2006)
  • Gene Ontology Consortium

    The gene ontology (GO) project in 2006

    Nucleic Acids Research

    (2006)
  • U. Güldener et al.

    MPact: The MIPS protein interaction resource on yeast

    Nucleic Acids Research

    (2006)
  • L. Hakes et al.

    Specificity in protein interactions and its relationship with sequence diversity and coevolution

    Proceedings of the National Academy of Sciences

    (2007)
  • H. Hishigaki et al.

    Assessment of prediction accuracy of protein function from protein–protein interaction data

    Yeast

    (2001)
  • Y. Ho et al.

    Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

    Nature

    (2002)
  • H. Hu et al.

    Mining coherent dense subgraphs across massive biological networks for functional discovery

    Bioinformatics

    (2005)
  • T. Ito et al.

    A comprehensive two-hybrid analysis to explore the yeast protein interactome

    Proceedings of the National Academy of Sciences

    (2001)
  • I. Ivanoska et al.

    Protein function prediction using semantic driven K-medoids clustering algorithm

    International Journal of Machine Learning and Computing

    (2014)
  • Cited by (2)

    View full text