SHOPIN: Semantic Homogeneity Optimization in Protein Interaction Networks

doi:10.1016/bs.apcsb.2015.07.004

Advances in Protein Chemistry and Structural Biology

Volume 101, 2015, Pages 323-349

https://doi.org/10.1016/bs.apcsb.2015.07.004 Get rights and content

Abstract

Protein interaction networks (PINs) are argued to be the richest source of hidden knowledge of the intrinsic physical and/or functional meanings of the involved proteins. We propose a novel method for computational protein function prediction based on semantic homogeneity optimization in PIN (SHOPIN). The SHOPIN method creates graph representations of the PIN augmented by inclusion of the semantics of the proteins and their interacting contexts. Network wide semantic relationships, modeled using random walks, are used to map the augmented PIN graphs in a new semantic metric space. The method produces a hierarchical partitioning of the PIN optimal in terms of semantic homogeneity by iterative optimization of the ratio of between clusters dissimilarities and within clusters similarities in the new semantic metric space. Function prediction is done using cluster wide-hierarchy high function enrichment. Results validate the rationale of the SHOPIN method placing it right next to state-of-the-art approaches performance wise.

Introduction

Proteins serve as building blocks and functional components of every living cell. These macromolecules take part in some of the most important functions in an organism, such as constitution of the organs, the catalysis of biochemical reactions (necessary for metabolism), and the maintenance of the cellular environment. Thus, the knowledge of protein's functions plays an important role in а broad range of processes like development of new drugs, better understanding of diseases, invention of synthetic biochemicals, etc. Early approaches for computational protein function prediction (Weaver, 2002) were low-throughput because of the enormous experimental and human effort that was required for analyzing a specific gene or protein. The breakthrough of high-throughput experimental technologies resulted in producing wide variety of useful data, ranging from simple protein sequences to complex proteomic data, such as gene expression data sets and protein interaction networks (PINs). However, these improvements come at a price of having a vast amount of proteins with unknown functions. The main challenge lies in developing cost-effective procedures to analyze such massive data which uncover their intrinsic physical and/or functional meaning. This is prompting the design of computational models capable of discovering hidden knowledge within the high-throughput proteomic data as one of the important challenges of the postgenomic era.

The concept of protein function is highly context-sensitive and is not clearly defined. It can be interpreted as an umbrella term for all the cellular, molecular, or physiological activities that a protein is involved in. There are different notational schemes for the protein function that can be organized either in a hierarchical structure (EC (Webb, 1992), SCOP (Murzin, Brenner, Hubbard, & Chothia, 1995), CATH (Orengo et al., 1997)) or in a graph-like structure (Gene Ontology (GO); Ashburner et al., 2000, Gene Ontology Consortium, 2006). GO is a structured and controlled vocabulary, based on solid computer science and biological principles, for describing gene and protein products and as such is an important step in the computational protein function prediction. GO characterizes proteins in three major aspects stored as separate ontologies within the GO: biological process, molecular function, and cellular component. Each ontology consists of a set of terms (GO terms), connected to each other in a directed acyclic graph. At this moment, GO can be recognized as the most applied functional annotation scheme across a wide variety of biological data (Jensen et al., 2003, Letovsky and Kasif, 2003) and as such is the scheme considered in our research.

The computational prediction of protein function is a challenging problem, and can be characterized by several key difficulties (Pandey, Kumar, Steinbach, & Meyers, 2012): (1) the number of function terms is rather large, (2) each protein can be annotated with several terms, and (3) the terms are hierarchically organized and are unbalanced. Taking into account the nature of the underlying proteomic data being analyzed, various computational models have been proposed to address one or more of these issues (Cesa-Bianchi et al., 2012, Jones et al., 2014, Pandey et al., 2012, Radivojac et al., 2013). The effectiveness of the approaches depends on the input data, i.e., the hidden knowledge that can be discovered from the data.

Protein–protein interaction (PPI) data produced by high-throughput techniques, such as microarray coexpression analysis and yeast 2-hybrid experiments, is one of the most commonly used sources for enriched representation of the biological data for an organism, since these are fundamental to almost all biological processes (Hakes, Lovell, Oliver, & Robertson, 2007). The PPI data have the nature of networks having proteins as nodes and pairwise interactions as links between the nodes. These networks are referred to as PINs. Such network representation provides a richer data format for protein function prediction compared to protein's sequence or structure data alone. The protein function prediction in terms of PIN can be seen as the process of understanding the proteins context in the PIN. This process can be augmented by taking into account the content and/or topology of the network and thus perform a more informed decision on the potential function(s) of a query protein. According to the approach being applied, PIN-based function prediction can be categorized as: (1) neighborhood-based (Hishigaki et al., 2001, McDermott et al., 2005, Schwikowski et al., 2000), where protein annotations are transferred from the most “dominant” annotations among the protein's neighbors; (2) global optimization-based (Deng et al., 2003, Nabieva et al., 2005), where the neighboring proteins may not contain enough information, so the functions of the query protein are inferred from the indirectly connected proteins; (3) clustering-based (Brandes et al., 2003, Dunn et al., 2005), where protein's functions are taken from the functional majority of the module (cluster) where the query protein belongs; and (4) association-based (Hu et al., 2005, Tan et al., 2005), similar to clustering approaches, but here functional modules are hypothesized from frequently occurring sets of interactions in PINs of protein complexes.

Clustering-based approaches are perhaps the most common approaches for global network analysis, since studying proteins in isolation is not enough for full understanding of the cell—(clusters of) interactions need to be delineated as well, since proteins work with other proteins to regulate and support each other for specific functions (Von Mering et al., 2002). Traditional graph-based clustering methods use different metrics to partition an unweighted PIN, but unfortunately they produce few giant core clusters with many tiny ones (Barabasi & Oltvai, 2004). This can be avoided by weighting the PIN according to its topological properties (Arnau et al., 2005, Friedel and Zimmer, 2006, Pereira-Leal et al., 2004, Rives and Galitski, 2003). Integration of additional PPI information in the calculation of the PIN weights is used in Luo et al. (2007), while more sophisticated weighting schemes combined with optimization techniques can be found in Bader and Hogue (2003), Enright, Van Dongen, and Ouzounis (2002), and King, Pržulj, and Jurisica (2004). Neighborhood density was used in Bader and Hogue (2003) to assign weights to nodes and then to find densely connected regions by outward traversal starting from a highly weighted protein. The numbers of intercluster and intracluster edges are used in King et al. (2004) to compute a cost function that is optimized by a local search algorithm. Random walks and Markov chains are combined in Enright et al. (2002) to simulate a flow on the graph during which currents in edges are strengthened or reduced which leads to partitioning in the graph. An insight view of the inherent organization of protein complexes can be provided by the core-attachment structure (Gavin et al., 2006). This characteristic has been successfully applied in the methods such as COACH (Wu, Li, Kwoh, & Ng, 2009) and CORE (Leung, Xiang, Yiu, & Chin, 2009) for detecting protein complexes from PINs.

Besides the topology-based weights, the edges of the PIN can be weighted using semantic information contained within its nodes. A protein in a PIN is annotated with one or more functional GO terms. These GO term sets can be used to infer semantic relationships between the proteins in a PIN. Most recent approaches to computational protein function prediction exploit the concept of semantic similarity measures to facilitate the decision-making process. PROCOMOSS (Mukhopadhyay, Ray, & De, 2012) uses a multiobjective evolutionary approach in which graphical properties as well as biological properties based on GO semantic similarity measure are considered as objective functions for detecting protein complexes in a PIN. CSO (Zhang et al., 2013) can effectively take advantage of the correlation between frequent GO annotation sets and the dense subgraph for protein complex prediction clustering. Modularity optimization algorithms are employed on a semantically enriched PIN in Trivodaliev, Bogojeska, and Kocarev (2014). Results presented in these papers are proof of the significance of the PIN contents for the clustering performance and for the function prediction.

Here, we present a novel clustering method for computational protein function prediction based on semantic homogeneity optimization in the PIN. We refer to this approach as semantic homogeneity optimization in PIN (SHOPIN). The aims of the proposed method are manifold. We give different graph representations of the PIN each having a different level of inclusion of the semantics of a protein and its interacting context in the PIN. We further model the network wide semantic relationship by mapping the previously defined graph representations into a new semantic metric space using random walks. These two steps are general and can be employed in any given clustering-based computational protein function prediction. In our approach, the clusters of the graph representing the PIN are produced by iteratively increasing the ratio of between clusters dissimilarities and within clusters similarities, and since we are clustering in the semantic metric space the produced result will be optimal in terms of semantic homogeneity of the final clusters. Once the final clusters are obtained, the protein function prediction is performed using cluster-based function enrichment. The performance of the SHOPIN method is evaluated in terms of its ability to correctly predict an annotation term set for a query protein. This evaluation is done using a leave-one-out method which iterates through all proteins of the PIN, having a single protein as a query protein per iteration. Each iteration predicts GO terms (protein functions) for the query protein and predictions are compared with the a priori knowledge about the query protein annotation GO term set.

Section snippets

Materials and Methods

In Fig. 1, we give the overview of the framework for our SHOPIN method consisting of four consecutive steps. As previously stated, each of these steps is independent and in this proposed architecture one can substitute any step with any other approach/algorithm. We first give information about the data used in our experiments, and then define the details used in our implementation of the framework.

Results and Discussion

In order to evaluate the performance of our method, we tested and compared it with several state-of-the-art clustering algorithms. The competing clustering algorithms we tested against are the most recently developed methods which are classified as having the best performances (Lancichinetti & Fortunato, 2009). The FC (Clauset, Newman, & Moore, 2004) and the BGLL (Blondel et al., 2008) algorithms infer the community structure of a network by greedy optimization of the modularity function (

Conclusions

High-throughput technologies produce vast amounts of useful data about proteins and their role in the complex mechanisms of the cell at an unprecedented rate, imposing as imperative the creation of cost-effective computational analysis of such data. PINs are paramount for the computational protein function prediction. This process of understanding the context of a protein in the PIN naturally exploits the graph properties of the PIN. We present a SHOPIN that exploits both the content and the

References (61)

T. Fawcett
An introduction to ROC analysis
Pattern Recognition Letters
(2006)
A.G. Murzin et al.
SCOP: A structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology
(1995)
C.A. Orengo et al.
CATH—A hierarchic classification of protein domain structures
Structure
(1997)
V. Arnau et al.
Iterative cluster analysis of protein interaction data
Bioinformatics
(2005)
M. Ashburner et al.
Gene ontology: Tool for the unification of biology
Nature Genetics
(2000)
G.D. Bader et al.
BIND—A data specification for storing and describing biomolecular interactions, molecular complexes and pathways
Bioinformatics
(2000)
G.D. Bader et al.
An automated method for finding molecular complexes in large protein interaction networks
BMC Bioinformatics
(2003)
A.L. Barabasi et al.
Network biology: Understanding the cell's functional organization
Nature Reviews. Genetics
(2004)
V.D. Blondel et al.
Fast unfolding of communities in large networks
Journal of Statistical Mechanics: Theory and Experiment
(2008)
U. Brandes et al.
Experiments on graph clustering algorithms
(2003)

B.J. Breitkreutz et al.

The GRID: The general repository for interaction datasets

Genome Biology

(2003)

N. Cesa-Bianchi et al.

Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference

Machine Learning

(2012)

A. Chatr-Aryamontri et al.

MINT: The molecular INTeraction database

Nucleic Acids Research

(2007)

A. Clauset et al.

Finding community structure in very large networks

Physical Review E

(2004)

M. Deng et al.

Prediction of protein function using protein–protein interaction data

Journal of Computational Biology

(2003)

R. Dunn et al.

The use of edge-betweenness clustering to investigate biological function in protein interaction networks

BMC Bioinformatics

(2005)

S.S. Dwight et al.

Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)

Nucleic Acids Research

(2002)

A.J. Enright et al.

An efficient algorithm for large-scale detection of protein families

Nucleic Acids Research

(2002)

T.S. Evans et al.

Line graphs, link partitions, and overlapping communities

Physical Review E

(2009)

T.S. Evans et al.

Line graphs of weighted networks for overlapping communities

The European Physical Journal B

(2010)

C.C. Friedel et al.

Inferring topology from clustering coefficients in protein–protein interaction networks

BMC Bioinformatics

(2006)

A.C. Gavin et al.

Proteome survey reveals modularity of the yeast cell machinery

Nature

(2006)

Gene Ontology Consortium

The gene ontology (GO) project in 2006

Nucleic Acids Research

(2006)

U. Güldener et al.

MPact: The MIPS protein interaction resource on yeast

Nucleic Acids Research

(2006)

L. Hakes et al.

Specificity in protein interactions and its relationship with sequence diversity and coevolution

Proceedings of the National Academy of Sciences

(2007)

H. Hishigaki et al.

Assessment of prediction accuracy of protein function from protein–protein interaction data

Yeast

(2001)

Y. Ho et al.

Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

Nature

(2002)

H. Hu et al.

Mining coherent dense subgraphs across massive biological networks for functional discovery

Bioinformatics

(2005)

T. Ito et al.

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Proceedings of the National Academy of Sciences

(2001)

I. Ivanoska et al.

Protein function prediction using semantic driven K-medoids clustering algorithm

International Journal of Machine Learning and Computing

(2014)

Cited by (2)

Comparative analysis of network embeddings for functional annotation in protein interaction networks
2020, 2020 43rd International Convention on Information, Communication and Electronic Technology, MIPRO 2020 - Proceedings
Deep Learning the Protein Function in Protein Interaction Networks
2018, Communications in Computer and Information Science

View full text

Chapter Nine - SHOPIN: Semantic Homogeneity Optimization in Protein Interaction Networks

Abstract

Introduction

Section snippets

Materials and Methods

Results and Discussion

Conclusions

Pattern Recognition Letters

Journal of Molecular Biology

Structure

Iterative cluster analysis of protein interaction data

Bioinformatics

Gene ontology: Tool for the unification of biology

Nature Genetics

BIND—A data specification for storing and describing biomolecular interactions, molecular complexes and pathways

Bioinformatics

An automated method for finding molecular complexes in large protein interaction networks

BMC Bioinformatics

Network biology: Understanding the cell's functional organization

Nature Reviews. Genetics

Fast unfolding of communities in large networks

Journal of Statistical Mechanics: Theory and Experiment

Experiments on graph clustering algorithms

The GRID: The general repository for interaction datasets

Genome Biology

Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference

Machine Learning

MINT: The molecular INTeraction database

Nucleic Acids Research

Finding community structure in very large networks

Physical Review E

Prediction of protein function using protein–protein interaction data

Journal of Computational Biology

The use of edge-betweenness clustering to investigate biological function in protein interaction networks

BMC Bioinformatics

Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)

Nucleic Acids Research

An efficient algorithm for large-scale detection of protein families

Nucleic Acids Research

Line graphs, link partitions, and overlapping communities

Physical Review E

Line graphs of weighted networks for overlapping communities

The European Physical Journal B

Inferring topology from clustering coefficients in protein–protein interaction networks

BMC Bioinformatics

Proteome survey reveals modularity of the yeast cell machinery

Nature

The gene ontology (GO) project in 2006

Nucleic Acids Research

MPact: The MIPS protein interaction resource on yeast

Nucleic Acids Research

Specificity in protein interactions and its relationship with sequence diversity and coevolution

Proceedings of the National Academy of Sciences

Assessment of prediction accuracy of protein function from protein–protein interaction data

Yeast

Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry

Nature

Mining coherent dense subgraphs across massive biological networks for functional discovery

Bioinformatics

A comprehensive two-hybrid analysis to explore the yeast protein interactome

Proceedings of the National Academy of Sciences

Protein function prediction using semantic driven K-medoids clustering algorithm

International Journal of Machine Learning and Computing