A sequential Monte Carlo approach to gene expression deconvolution

Oyetunji E. Ogundijo; Xiaodong Wang

doi:10.1371/journal.pone.0186167

Abstract

High-throughput gene expression data are often obtained from pure or complex (heterogeneous) biological samples. In the latter case, data obtained are a mixture of different cell types and the heterogeneity imposes some difficulties in the analysis of such data. In order to make conclusions on gene expresssion data obtained from heterogeneous samples, methods such as microdissection and flow cytometry have been employed to physically separate the constituting cell types. However, these manual approaches are time consuming when measuring the responses of multiple cell types simultaneously. In addition, exposed samples, on many occasions, end up being contaminated with external perturbations and this may result in an altered yield of molecular content. In this paper, we model the heterogeneous gene expression data using a Bayesian framework, treating the cell type proportions and the cell-type specific expressions as the parameters of the model. Specifically, we present a novel sequential Monte Carlo (SMC) sampler for estimating the model parameters by approximating their posterior distributions with a set of weighted samples. The SMC framework is a robust and efficient approach where we construct a sequence of artificial target (posterior) distributions on spaces of increasing dimensions which admit the distributions of interest as marginals. The proposed algorithm is evaluated on simulated datasets and publicly available real datasets, including Affymetrix oligonucleotide arrays and national center for biotechnology information (NCBI) gene expression omnibus (GEO), with varying number of cell types. The results obtained on all datasets show a superior performance with an improved accuracy in the estimation of cell type proportions and the cell-type specific expressions, and in addition, more accurate identification of differentially expressed genes when compared to other widely known methods for blind decomposition of heterogeneous gene expression data such as Dsection and the nonnegative matrix factorization (NMF) algorithms. MATLAB implementation of the proposed SMC algorithm is available to download at https://github.com/moyanre/smcgenedeconv.git.

Citation: Ogundijo OE, Wang X (2017) A sequential Monte Carlo approach to gene expression deconvolution. PLoS ONE 12(10): e0186167. https://doi.org/10.1371/journal.pone.0186167

Editor: Lars Kaderali, Universitatsmedizin Greifswald, GERMANY

Received: February 26, 2017; Accepted: September 26, 2017; Published: October 19, 2017

Copyright: © 2017 Ogundijo, Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The Affymetrix dataset is available for download from: https://github.com/moyanre/smcgenedeconv/blob/master/Affymetrix_dataset.mat. The GEO NCBI datasets with GSE11058 and GSE19830 are available for download from: https://www.ncbi.nlm.nih.gov/geo/.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Gene expression measurement technologies, for example, deoxyribonucleic acid (DNA) microarray, have made it possible to conduct simultaneous expression measurements from thousands of genes on a genome-wide scale [1–4]. Gene expression data obtained from pure samples, comprising of a single cell type, can be analyzed to yield a significant amount of information. For instance, measuring gene expression levels in different conditions may prove useful in medical diagnosis, treatment prescription, drug design [5, 6] and most importantly in the identification of genes that are differentially expressed between groups of samples [7], such as tumor versus non-tumor tissues [8].

However, in heterogeneous samples, where more than one cell types are present, drawing any reasonable conclusion is a difficult task because each of the cell types in the sample will contribute differently to the measured expression of a given gene [9]. In some cases, manual methods such as laser microdissection (LMD) [10] and flow cytometry [11] are employed to isolate cells of interest from the complex mixtures. In spite of that, there are some limitations in using these techniques. For instance, they are very expensive and often come with low cell throughput rate [12–14], resulting in a drastic reduction in the yield of biological contents.

In the literature, different computational methods have been proposed for the deconvolution of gene expression data from heterogeneous biological samples, and these methods can be loosely grouped into two categories: either deterministic or probabilistic. Of the two, the deterministic approach is more popular. For instance, in addition to the gene expression data, if the information about the cell-type specific gene expression profiles is available, proportions of cellular types can be estimated [15], for example, via linear regression [16–18], a very common technique for analyzing biological data [19]. On the other hand, if in addition to the gene expression data, cellular proportions are known, then with linear regression, cell-type specific gene expression profiles can be estimated [7, 20, 21]. Further, [22–24] investigated the efficacy of the nonnegative matrix factorization (NMF) algorithms [25, 26] for the “blind” deconvolution of gene expression data in the presence of additional constraints, for example, some prior biological knowledge [22, 23]. Moreover, [27] proposed a probabilistic approach based on the Markov chain Monte Carlo (MCMC) method, assuming an availability of a good initial estimate of the cell type proportions. All the approaches mentioned so far, either deterministic or probabilistic, made one or more assumptions about the availability, either precise or a rough estimate, of the cell type proportions or the cell-type specific profiles. But in reality, often times, all we have is the heterogeneous gene expression data.

In this paper, we propose a new probabilistic method, sequential Monte Carlo (SMC) sampler [28–31] for static models to estimate the cell type proportions and the cell-type specific expression profiles, given the heterogeneous gene expression data. Specifically, we model the heterogeneous gene expression data using a Bayesian framework where the cell-type specific expression profiles and the cell type proportions are the unknown model parameters. We seek to approximate, in an efficient way, the posterior distributions of all the unknown model parameters by a set of weighted samples (particles) from which their respective point estimates can be obtained. Bayesian inference is an important area in the analyses of biological data [32, 33] as it provides a complete picture of the uncertainty in the estimation of the unknown parameters of a model given the data and the prior distributions for all the unknown model parameters.

In particular, the SMC method is a class of sampling algorithms which combines importance sampling and resampling [34, 35]. More importantly, the SMC framework for static models is very similar to the sequential importance sampling (resampling) (SIS) procedure for dynamic models [34], the only difference being the framework under which the samples are propagated and this results in differences in the calculation of the weights of the samples. In general, SMC allows us to treat, in a principled way, any type of probability distribution, nonlinearity and non-stationarity [36, 37]. It is easy to implement and applicable to very general settings. As noted in [28], SMC algorithms address some of the major shortcomings of the MCMC-based algorithms: (i) diagnosing convergence of a Markov chain (ii) requirement of burn-in period, and (iii) MCMC algorithms getting trapped in local modes if the target distribution is highly multi-modal. In addition, in big data analyses, unlike the MCMC approach, SMC algorithms can be parallelized to reduce the computational time [28].

We compared the proposed SMC method with existing methods, including Dsection algorithm in [27] that is based on the MCMC approach and the recently proposed probabilistic nonnegative matrix factorization (PNMF) algorithm [38], a stochastic version of the deterministic NMF framework that takes into account the stochastic nature of the gene expression data. Overall, in terms of the accuracy of estimates of cell type proportions, cell-type specific gene expressions, and in addition, in the identification of differentially expressed genes, the proposed method demonstrated a superior performance. More importantly, the proposed method does not require that we have an initial estimate of the cell type proportions or the cell-type specific expression profiles.

The remainder of this paper is organized as follows. In Section 2, we present the Materials and Methods. In Section 3, we investigate the performance of the proposed method using simulated datasets artificially obtained from downloaded pure tissues expression profiles and heterogeneous (impure) samples downloaded from Affymetrix oligonucleotide arrays and GEO NCBI websites, the set of data that have been employed to assess the performance of deconvolution algorithms. Finally, Section 4 concludes the paper.

In this paper, we use the following notations:

p(⋅) and p(⋅|⋅) denote a probability and a conditional probability density functions, respectively.
denotes the Gaussian probability density function with mean μ, precision λ and variance λ⁻¹.
Gamma(α, β) denotes the Gamma probability density function with shape parameter α and rate parameter β.
denotes a uniform distribution with support x ∈ [a, b].
x and x^T denote a column vector and its transpose, respectively.
X and denote a matrix and its estimate, respectively.

Materials and methods

Let Y be an I × J gene expression matrix obtained from tissue samples with heterogeneous population, where I denotes the number of probes (or genes) in the measurements and J denotes the total number of samples present. We assume that the number of cell types, K, in the samples is known and each sample has the same number of cell types present, but in varying percentages. Although, modeling the relationship between the expression value of pure and mixed samples is not strictly linear, linearity has proved to be a reasonable and valid assumption in gene expression deconvolution [7, 16, 27, 39]. As such, we follow the linear modeling approach in analyzing the tissue samples. Denoting the indices of cell type, tissue sample and gene by k, j and i, respectively, then the expression value of gene i in sample j is the sum of its expressions in all K cell types, i.e., (1) where x_ik denotes the specific expression of gene i in cell type k, m_kj denotes the proportion of cell type k in sample j and e_ij is an additive Gaussian distributed noise with zero mean and precision λ (inverse of variance). Instead of one gene at a time, if all the genes are considered at once, then (1) can be written in a matrix form as follows: (2) where Y denotes the I × J matrix of gene expression measurement from heterogeneous samples, X denotes the unknown I × K matrix of expression levels of the genes in all the cell types (pure cell type expression signatures), M denotes the unknown K × J matrix of cell type proportions and E is the additive noise matrix of dimension I × J. Note that all elements of M are non-negative and each column sums to 1.

The goal of the inference is to obtain an estimate of the unknown matrices X and M, which are the cell-type specific signatures and the cellular proportions, respectively and in addition, an estimate of the precision λ, given the heterogeneous gene expression matrix Y. To do this, we define a data generating model, impose prior distributions on all the unknown model parameter, derive the sequence of target distributions for all the model parameters and finally, present the SMC algorithm that estimates, in an efficient manner, the posterior distributions of all the unknown model parameters.

Likelihood function

As shown in (1), the data point for probe i in sample j i.e., y_ij, is modeled as a sum of the cell-type specific expressions of probe i for all cell types, i.e. the i^th row of matrix X, denoted by x_i,:, weighted by the proportions of all cell types in sample j, i.e., the j^th column of matrix M, denoted by m_:,j plus an additive Gaussian distributed noise, e_ij i.e., (3) Further, if we assume independent and identically distributed (IID) measurements for the data points in matrix Y, then the joint data likelihood function can be written as: (4) where θ = {λ, x_ik, m_kj: i = 1, …, I, j = 1, …, J, k = 1, …, K} are the unknown parameters of the model that will be estimated.

Prior densities for all model parameters

Here, we present the prior distributions for all the unknown parameters in the model in (4). With the prior distributions accurately specified and with the model in (4), we can obtain the sequence of target distributions for all the unknown model parameters.

Prior densities for the cell-type specific expressions.

We model the specific expression of gene i in cell type k, x_ik with a Gaussian distribution, i.e., , where μ_ik and ν_ik are the mean and precision, respectively, and are assumed known [27, 38]. Gaussian distribution is preferred so as to make use of the property of conjugate priors, i.e., the sequence of target distributions will remain Gaussian given that the prior and the likelihood distributions are Gaussian [40]. Detailed derivations of the sequence of target distributions and the choice of μ_ik and ν_ik are discussed in S1 Supplementary Material.

Prior densities for the cell type proportions.

We impose a Gaussian distribution on the proportion of cell type k in sample j, m_kj i.e, , where μ_kj and ν_kj are the mean and precision, respectively, and are assumed known [38]. Although, other distributions can be considered, surprisingly, Gaussian distribution performs well in our experiments. Detailed derivations of the sequence of target distributions and the how μ_kj and ν_kj are picked are discussed in S1 Supplementary Material.

Prior density for the precision.

Gamma prior is placed on the inverse of the noise variance (precision), i.e, λ ∼ Gamma(α, β), with α and β assumed known. The choice of Gamma prior distribution ensures that the sequence of target distributions for the precision parameter will be Gamma distributions (conjugate prior property), given that the likelihood is a Gaussian distribution [40]. Detailed derivations of the sequence of target distributions and the choice of α and β are discussed in S1 Supplementary Material.

Sequential Monte Carlo samplers for Bayesian inference

General principle of SMC samplers.

Before we introduce the SMC sampler algorithm for gene expression decomposition, we will succinctly describe the general principle of SMC samplers in Bayesian inference settings [28–30]. Denote the prior distribution, the likelihood function and the posterior distribution in a Bayesian inference setup as p(θ), p(Y|θ) and p(θ|Y), respectively. Using the Bayes rule, the posterior distribution can be written as a function of the prior distribution and the likelihood function as follows: (5) where Z = ∫_Θ p(θ)p(Y|θ)dθ, a constant with respect to θ, is referred to as the evidence. With SMC samplers, rather than sampling from the posterior distribution p(θ|Y) in (5), a sequence of intermediate target distributions, , are designed, that transitions smoothly from the prior distribution, i.e., π₁ = p(θ), which is usually easier to sample from, and gradually introduce the effect of the likelihood so that in the end, we have π_T = p(θ|Y) which is the posterior distribution of interest [28, 29]. For such sequence of intermediate distributions, a natural choice is the likelihood tempered target sequence [28, 41]: (6) where is a non-decreasing temperature schedule with ϵ₁ = 0 and ϵ_T = 1, is the unnormalized target distribution and is the evidence at time t.

Next, we transform this problem in the standard SMC filtering framework [34, 35] by defining a sequence of joint target distributions up to and including time t, which admits π_t as marginals as follows: (7) where the artificial kernels are referred to as the backward Markov kernels, i.e., denotes the probability density of moving back from θ_t+1 to θ_t [28, 29, 42]. However, it is often difficult to sample directly from the joint target distribution in (7). Instead, samples are obtained from another distribution, known as the importance distribution, with a support that includes the support of [34]. Thus, we define the importance distribution at time t, q_t(θ_1:t) as follows: (8) where are the Markov transition kernels or forward kernels, i.e., denotes the probability density of moving from θ_t−1 to θ_t [28, 29].

Given that at time t − 1, we desire to obtain N random samples from the target distribution in (7), but as discussed earlier, it is difficult to sample from the target distribution and instead, we obtain the samples from the importance distribution in (8). Following the principle of importance sampling, we then correct for the discrepancy between the target and the importance distributions by calculating the importance weights [34]. The unnormalized weights associated with the N samples are obtained as follows: (9) and the normalized weights are calculated as: As such, the set of weighted samples approximates the joint target distribution . To obtain an approximation to the joint target distribution at time t, i.e, , the samples are first propagated to the next target distribution using a forward Markov kernel to obtain the set of particles . Similar to (9), we then correct for the discrepancy between the importance distribution and the target distribution at time t. Thus, the unnormalized weights at time t are calculated as follows: (10) from (9), we have from the definitions of π_t and π_t−1 in (6) and noticing that Z_t and Z_t−1 are constants with respect to and , then where are the unnormalized weights at time t − 1, given in (9) and , the unnormalized incremental weights, calculated as (11)

Resampling procedure.

In the SMC procedure described above, after some iterations, all samples except one will have very small weights, a phenomenon referred to as degeneracy in the literature. It is unavoidable as it has been shown that the variance of the importance weights increases over time [34]. An adaptive way to check this is by computing the effective sample size (ESS) as follows: [43]. To avoid degeneracy, one performs resampling when the ESS is significantly less than the number of samples, discarding the ineffective samples and then multiply the effective ones [37, 44]. In all our experiments, we performed resampling when the ESS is less than N/10 [45]. The resampling procedure is briefly summarized as follows:

Interpret each weight as the probability of obtaining the sample index n in the set .
Draw N samples from the discrete probability distribution and replace the old sample set with this new one.
Set all weights to the constant value .

Target distributions, forward and backward kernels specification for gene expression deconvolution.

In (6)–(8), we need to specify the exact form of the sequence of target distributions , the forward kernels, and the backward kernels for the problem of gene expression deconvolution.

Sequence of target distributions and forward kernels: As earlier discussed, we are interested in the likelihood tempered target sequence in (6). Here, we present the sequence of target distributions for all the parameters in the model presented in (4). Details of the derivations are in S1 Supplementary Material. Define then the sequence of target distributions for the cell type proportions are: (12) the sequence of target distributions for the cell-type specific expressions are given as: (13) and finally, the sequence of target distributions for the precision are given as: (14) The optimal forward Markov kernel, in the sense of minimizing the variance of the importance weights is [28, 29]. In general, if π_t is not available in closed form (non-conjugate priors), then an MCMC kernel of invariant distribution π_t will be used for (Metropolis-Hastings MCMC). Fortunately, in our model, we are able to compute the sequence analytically as shown in (12)–(14).
Sequence of backward kernels: In order to obtain a good performance, the backward kernel is optimized with respect to the forward kernel as this choice will affect the variance of the importance weights. Hence, the following is employed [28, 30]: (15) since it generally represents a good approximation of the optimal backward kernel when the discrepancy between π_t and π_t−1 is small [29, 31]. Thus, the unnormalized incremental weights in (11) become: (16) where ϵ_t − ϵ_t−1 is the step length of the cooling schedule of the likelihood at time t. The derivation of the exact analytical expression in (16) for the gene expression deconvolution problem is presented in S1 Supplementary Material.

Finally, since the unnormalized incremental weights in (16) at time t does not depend on the particle values at time t but just on the previous particle set, the particles should be sampled after the weights have been computed and after the particle approximation has possibly been resampled [28].

SMC sampler algorithm for gene expression deconvolution

1. Input: Heterogeneous gene expression matrix Y, α, β, {μ_kj, ν_kj: k = 1, …, K, j = 1, …, J}, {μ_ik, ν_ik: i = 1, …, I, k = 1, …, K}, and the temperature schedule 0 = ϵ₁ < ϵ₂…<ϵ_T = 1 (See the S1 Supplementary Material for the initial values).

2. Set t = 1

for n = 1: N

Take a sample from Gamma(α, β).

for k = 1: K

for j = 1: J

Take a sample from .

end

for i = 1: I

for k = 1: K

Take a sample from .

end

Set .

3. for t = 2: T repeat the following steps:

• Compute the unnormalized weights as follows using (16):

• Normalization of the weights:

• Compute and resample if ESS < N/10.

• Propagation of particles:

for n = 1: N

Take a sample from π_t(λ|⋅) in (14).

for k = 1: K

for j = 1: J

Take a sample from π_t(m_kj|⋅) in (12).

end

for i = 1: I

for k = 1: K

Take a sample from π_t(x_ik|⋅) in (13).

end

4. Compute the estimate of the parameters as follows: (17) then the estimates of the cell type proportions matrix , cell-type specific expression matrix and the precision are obtained from for further analyses (Note that each column of is re-scaled to sum to unity).

Results

Ground-truth for variables

We assessed the performance of the proposed method, which we will refer to as the SMC method, on both simulated dataset and datasets that contain real mixed samples. For ease of exposition, denote , where matrix Y_total is the downloaded matrix of pure and mixed gene expressions, matrix Y is the gene expression for the heterogeneous/mixed samples and is the gene expression matrix for the pure samples (the expression profile of each sample often come in multiplicity, e.g., technical replicates). First, we compared the estimates of the cell types proportion and the cell-type specific expression matrices with some existing methods and secondly, we went further to test the ability of the proposed method to identify differentially expressed genes. Next, we present the “ground-truth” for all the unknown variables in our analyses. Unless otherwise stated, all the datasets used in the analyses are not log transformed.

Ground-truth for the cell types proportions and the cell-type specific expression profiles (matrices M and X).

For all datasets, “ground-truth” is available for the cell type proportions matrix M. For the pure cell-type expression signatures, matrix X, “ground-truth” is computed from the matrix , the gene expression for the pure samples. Denote , where , is the gene expression matrix that contains replicate samples from pure cell type k, then, x_ik is computed as the mean of row i in matrix , that is, the mean expression for gene i across samples that contain only cell type k.

List of differentially and non-differentially expressed genes.

We produced the “ground-truth” for the list of differentially expressed and non-differentially expressed genes from the “ground-truth” for the cell-type expression signatures, matrix X, using the fold change rule (Although, the median fold change proposed in [46] is theoretically a slightly better alternative to the mean fold change, empirical results from both method are similar for all our datasets. More so, mean fold change is better suited to our purpose because in the end, we estimate the mean expression for each cell type [47]). For gene i, the fold change between cell types r and u is defined as: FC_i = max(x_ir, x_iu)/min(x_ir, x_iu), where x_ir and x_iu are the specific expressions of gene i in cell types r and u, r, u ∈ {1, …, K} [46–48]. Thus, given the specific expressions of gene i in cell types r, u ∈ {1, …, K}, if FC_i > 2, gene i is said to be differentially expressed in the two cell types, otherwise no difference in expressions [49].

Cell types mapping and marker probesets.

Estimates of the cell-type specific expression profiles obtained from any blind decomposition algorithm require mapping to the correct cell types [22]. As such, marker probesets are often employed to perform the mapping of the estimated profiles to the true cell types. However, gene expression data are generated with different technologies (microarrays and RNA-seq) using equipment from different manufacturers (e.g. Affymetrix, Illumina etc.). To avoid discrepancies that may arise in using probeset marker lists from another source due to probe annotation [50, 51], we defined the list of marker probesets used in our experiments from the gene expression measurements of pure cell types/tissues samples, i.e. matrix and matrix X, following the procedures highlighted in [22]. Details of how the marker probesets are defined and the mapping of the estimated profiles to the true cell types are discussed in S1 Supplementary Material.

Metrics for comparing results.

Notice that the mapping of estimated cell-type profiles to the true cell types also rearranges the rows of the estimated proportions, matrix . Now, to compare the estimated variables with the true values, we compared the average mean absolute difference for the simulated datasets and then calculated the Pearson correlation coefficient (r) between the true value and the estimated value for the real data.

In addition, we tested if the proposed SMC method can identify differentially expressed genes between cell types. Given the “ground-truth” for the truly differentially and non-differentially expressed genes, we computed, for each probeset, the expression fold change between the columns of the estimated cell-type gene expression profiles, matrix . Specifically, between any two columns of matrix and for each probeset (and if cell type 1 is upregulated when compared to cell type 2 or vice-versa, separately), we computed the following by varying the fold change threshold from 1 to 5 in step of 0.25: true positives (TP), the number of correctly identified probes that are truly differentially expressed; false positives (FP), the number of non-differentially expressed probes but incorrectly identified as differentially expressed genes; false negatives (FN), the number of truly differentially expressed genes but incorrectly identified as non-differentially expressed probes, and true negatives (TN), the number of correctly identified non-differentially expressed probes. Further, we computed the sensitivity or true positive rate (TPR) = TP/(TP+FN) and the false positive rate (FPR), also defined as 1− specificity = FP/(FP+TN). With the TPR and the FPR for the different threshold values, we generated the receiver operating characteristic curves (ROC) for all pairs of cell types. Area under the ROC (AUROC) is obtained for each plot. High value of AUROC (maximum is 1) indicates that the deconvolution method is specific and sensitive in identifying differentially expressed probeset.

In addition, to compare our method with other existing gene expression deconvolution methods that require same set of input data, we analyzed the datasets with two other methods: another sampling algorithm developed by [27] which we will refer to as the MCMC method and a recently developed probabilistic version of NMF [38] which we will refer to as the PNMF method. Although, the MCMC method assumes that a rough estimate of the mixing proportions might be available, in some cases, in addition to the gene expression data, we initialized all methods with equal cell type proportion in order to produce a fair comparison of the results. Also, for the NMF method, cell-type specific gene expression profiles, matrix X is initialized by drawing its entries from a uniform distribution .

Simulated dataset

To test the proposed algorithm on simulated data, we created heterogeneous gene expression datasets with varying number of samples from pure tissue samples. Specifically, we downloaded the gene expression measurements (tissue specific gene expression data) from the publicly available dataset series GSE1133, from the GEO website [52] for human lung, heart and liver. Data preprocessing, that is, background adjustment, normalization, and summarization were done with robust multi-array average (RMA) procedure [53]. For the cell type proportion matrix M, each column of the matrix is generated from a Dirichlet distribution. Heterogeneous gene expression measurement is then created by multiplying the tissue specific gene expression profiles, matrix X by the simulated cell type proportions, matrix M. Finally, normally distributed noise with mean zero and variance that is equal to the global variance in gene expression between duplicate samples in GSE1133, is added. Then, we created heterogeneous gene expression data, matrix Y that comprises of 10, 15, 20, 25, 30, 35 and 40 samples, respectively.

With each sample size, we made 25 experimental runs with each of the proposed SMC algorithm, MCMC method and the PNMF method. For each of the methods and a sample size, we record the mean absolute difference (MAD) between the true cell type proportions and the estimated cell type proportions after each experimental run and average MAD was computed after 25 runs. The results for the average and the standard deviation of MAD for the three methods and all the sample sizes are presented in Figs 1 and 2. In addition, for each sample size, we took the average of the estimated standard deviations over the 25 experimental runs. For each sample size, we showed, in Fig 3, a scatter plot of the standard deviations for the SMC and the MCMC methods (PNMF algorithm returned only the maximum a posteriori (MAP) estimates). Overall, the proposed SMC method outperforms its two other counterparts across all the sample sizes, in terms of the accuracy of the estimates. In addition, it can be seen that as the number of sample sizes goes up, estimates of model parameters also improve.

Download:

Fig 1. Plot of average MAD for different sample size.

Plot of average MAD calculated from varying the sample size for all the methods (simulated datasets).

https://doi.org/10.1371/journal.pone.0186167.g001

Download:

Fig 2. Plot of standard deviation of MAD.

Plot of standard deviation of MAD for all the methods (simulated datasets).

https://doi.org/10.1371/journal.pone.0186167.g002

Download:

Fig 3. Plot of standard deviation of parameter estimates.

Standard deviation of the estimates obtained from the proposed SMC and MCMC methods.

https://doi.org/10.1371/journal.pone.0186167.g003

Moreover, we investigated how much the results obtained from the proposed SMC algorithm depends on the choice of the prior distributions. Specifically, we considered a Dirichlet distribution for modeling each column of the cell type proportions (non-conjugate prior), matrix M. With this choice of prior distribution, the sequence of target distributions π_t for the mixture proportions are no more in closed form as we have in (12). Thus, to propagate the particles after the resampling procedure in the proposed SMC algorithm, we employed an Metropolis-Hastings MCMC kernel of invariant distribution π_t [28]. For each particle, we ran 10 chains and the last iteration is chosen as the propagated particle. On the GSE1133 dataset with 10 samples and 500 randomly chosen genes, the results obtained for the conjugate and the non-conjugate prior distributions (Dirichlet distributions) are shown in Table 1. Particularly, we recorded the correlation coefficient (r) and the runtime for the two cases on a 3.5 Ghz Intel 8 processors running MATLAB. From Table 1, the two cases yielded similar results in terms of the accuracy of the estimates, but the algorithm implemented with the non-conjugate priors is slower than its counterpart with conjugate priors. This is due to the fact that the MCMC kernel used in propagating the particles ran multiple iterations for each particle, and the similarity in the results is because the MCMC kernel used has an invariant distribution π_t, where the particles are sampled from.

Download:

Table 1. Effect of the choice of priors for the proposed SMC algorithm.

https://doi.org/10.1371/journal.pone.0186167.t001

Lastly, on the same dataset, we performed experiments with the MCMC method and the PNMF algorithm. In particular, the MCMC was run with chain length of 40,000, with the initial 20000 as burn-in and a thinning interval of 20. The results are shown in Table 2

Download:

Table 2. Runtime of different methods on the same dataset.

https://doi.org/10.1371/journal.pone.0186167.t002

Affymetrix dataset: 2 cell types

Next, we evaluated the performance of the proposed SMC algorithm on a tissue mixture oligonucleotide microarray probe-level dataset from Affymetrix previously analyzed by [27]. Data preprocessing were done by the RMA procedure [53]. This dataset, Y_total, consists of heterogeneous expressions from human brain and heart cells. There are 33 samples and each sample comprises of specific proportions of the two distinct cell types. The true mixture proportions are shown in Table A in S1 Supplementary Material where the samples are designated S1,…,S33 for sample 1,…,sample 33, respectively. Samples S1—S3 and S31—S33, samples from the pure cell types, constitute the matrix , for approximating the “ground-truths” for the cell-type expression profiles (matrix X), marker probesets and the list of truly differentially expressed and non-differentially expressed genes. Samples S4—S30 constitute the heterogeneous gene expression matrix Y that was analyzed.

First, we analyzed the heterogeneous gene expression matrix Y with the SMC method and the plot of the estimated proportions, matrix versus the true proportions, matrix M is shown in Fig 4 with the Pearson correlation coefficient, r = 0.99. In Table 3, we record the correlation between the true and the estimated cell-type specific expression profiles for all the cell types. Further, we test the power of the SMC method to detect truly differentially expressed and non-differentially expressed genes between cell types. Figs 5 and 6 show the ROCs generated with the SMC method and the AUROC for each plot is recorded in Table 3. Moreover, we analyzed the same dataset with the MCMC method and the PNMF algorithms and the results are presented in Figs 7 and 8, and in Table 3. The results obtained and presented in Table 3 show that the proposed SMC method accurately estimates cell type proportions, cell-type specific expressions and in fact, more specific in identifying the differentially expressed genes when compared to the other two methods.

Download:

Fig 4. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the proposed SMC method (affymetrix dataset).

https://doi.org/10.1371/journal.pone.0186167.g004

Download:

Fig 5. Brain > Heart.

ROC plot obtained from the proposed SMC method for brain vs. heart cell types, brain upregulated (affymetrix dataset).

https://doi.org/10.1371/journal.pone.0186167.g005

Download:

Fig 6. Heart > Brain.

ROC plot obtained from the proposed SMC method for brain vs. heart cell types, heart upregulated (affymetrix dataset).

https://doi.org/10.1371/journal.pone.0186167.g006

Download:

Fig 7. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the MCMC method (affymetrix dataset).

https://doi.org/10.1371/journal.pone.0186167.g007

Download:

Fig 8. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the PNMF method (affymetrix dataset).

https://doi.org/10.1371/journal.pone.0186167.g008

Download:

Table 3. Pearson correlation coefficient (r) and AUROC for the affymetrix dataset (AUROC in columns 3 and 4).

https://doi.org/10.1371/journal.pone.0186167.t003

GEO series GSE19830 dataset: 3 cell types

In the mixture experiment by [7], tissue samples from the liver, brain and lung of a single rat were analyzed using Affymetrix expression arrays. Biospecimens from the three different tissues were mixed in different proportions (mixture proportion of each sample is shown in Table B in S1 Supplementary Material). The data consists of 11 different mixtures, each mixture with 3 technical replicates. In addition, there are 9 samples for the pure tissues (S1—S9), 3 technical replicates for each pure tissue type. We downloaded the dataset from the NCBI GEO website and performed data preprocessing with the RMA.

We analyzed the heterogeneous gene expression matrix with the SMC method and the plot of the estimated proportions, matrix versus the true proportions, matrix M is shown in Fig 9 with the Pearson correlation coefficient, r = 0.99 (similar results are obtained for the MCMC and the PNMF methods in Figs 10 and 11, respectively). In addition, we record the correlation between the true and the estimated cell-type specific expression profiles in Table 4. Next, on this dataset, we test the power of the SMC method to detect truly differentially expressed and non-differentially expressed genes between cell types. Figs 12, 13 and 14 (and Fig A in S1 Supplementary Material) show the ROCs generated with the SMC method and the AUROC for each plot is recorded in Table 5. Moreover, we analyzed same dataset with the MCMC method and the PNMF algorithm and the results for the correlations and AUROC are presented in Tables 4 and 5, respectively. The results obtained show that the proposed SMC method accurately estimates cell type proportions, cell-type specific expressions and in fact, more specific in identifying the differentially expressed and non-differentially expressed genes when compared to the two other methods.

Download:

Fig 9. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the proposed SMC method (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g009

Download:

Fig 10. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the MCMC method (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g010

Download:

Fig 11. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the PNMF method (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g011

Download:

Fig 12. Liver > Brain.

ROC plot obtained from the proposed SMC method for liver vs. brain cell types, liver upregulated (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g012

Download:

Fig 13. Liver > Lung.

ROC plot obtained from the proposed SMC method for liver vs. lung cell types, liver upregulated (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g013

Download:

Fig 14. Brain > Lung.

ROC plot obtained from the proposed SMC method for brain vs. lung cell types, brain upregulated (GSE19830 dataset).

https://doi.org/10.1371/journal.pone.0186167.g014

Download:

Table 4. Pearson correlation coefficient (r) for the GSE19830 dataset.

https://doi.org/10.1371/journal.pone.0186167.t004

Download:

Table 5. AUROC for the GSE19830 dataset.

https://doi.org/10.1371/journal.pone.0186167.t005

GEO series GSE11058 dataset: 4 cell types

In the real mixtures with 2 and 3 cell types, expression differences between different cell types are relatively higher compared to the expression differences between cell types within a tissue sample. Hence, we tested the proposed algorithm on real tissue samples that are composed of cell types with gene expression profiles that are more similar to each other. Specifically, we analyzed a publicly available dataset from the GEO series GSE11058, downloaded from the NCBI GEO [54] and data preprocessing was done by RMA. Each heterogeneous sample in the data comprises of 4 different cell lines of immune origin, namely: Jurkat (J), IM-9 (I), Raji (R) and THP-1 (T). In total, there are 24 samples in the dataset, that is, triplicates of each pure cell type and four different mixtures for which the relative proportions of each cell type are known, as shown in Table C in S1 Supplementary Material where samples are designated S1,…,S24 for sample 1,…,sample 24, respectively. The first 12 samples, samples from pure cell types constitute the matrix , which is used for approximating the “ground-truths” for the cell-type expression profiles (matrix X), marker probesets and the list of truly differentially expressed and non-differentially expressed genes.

Samples S13—S24 constitute the heterogeneous gene expression matrix Y that we analyzed with the proposed SMC method, the MCMC method and the PNMF method. Figs 15, 16 and 17 and Table 6 show the correlation values obtained between the estimated cellular proportions and the true proportions, and then the estimated cell-type specific expression profiles and the true expression profiles. In addition, AUROC for all methods is shown in Table 7 and the ROC plots obtained for the proposed SMC method are shown in Figs 18, 19 and 20 and in Figs B and C in S1 Supplementary Material. Again, the SMC method outperformed the MCMC method and the PNMF method in terms of the accuracy of the cellular proportions estimates and the cell-type specific expression estimates, and finally, in identifying differentially and non-differentially expressed genes.

Download:

Fig 15. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the proposed SMC method (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g015

Download:

Fig 16. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the proposed MCMC method (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g016

Download:

Fig 17. Plot of proportions.

Plot of the true proportions vs. estimated proportions obtained from the proposed PNMF method (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g017

Download:

Fig 18. Jurkat > IM-9.

ROC plot obtained from the proposed SMC method for Jurkat vs. IM-9 cell types, Jurkat upregulated (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g018

Download:

Fig 19. Jurkat > Raji.

ROC plot obtained from the proposed SMC method for Jurkat vs. Raji cell types, Jurkat upregulated (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g019

Download:

Fig 20. Jurkat > THP-1.

ROC plot obtained from the proposed SMC method for Jurkat vs. THP-1 cell types, Jurkat upregulated (GSE11058 dataset).

https://doi.org/10.1371/journal.pone.0186167.g020

Download:

Table 6. Pearson correlation coefficient (r) for the GSE19830 dataset.

https://doi.org/10.1371/journal.pone.0186167.t006

Download:

Table 7. AUROC for the GSE19830 dataset.

https://doi.org/10.1371/journal.pone.0186167.t007

Discussion

In this paper, we modeled the heterogeneous gene expression data using a Bayesian framework. Specifically, we modeled the expression of a gene in each sample as the sum of expressions of that gene in all the constituting cell types in the sample, weighted by the proportions of all cell types in the sample plus an additive Gaussian noise.

We proposed an efficient SMC algorithm, a novel Bayesian approach that is based on sampling technology suited for approximating the posterior distributions of complex model parameters. In this paper, we obtained the estimates of the cellular proportions (matrix M) and the cell-type specific expression profiles (matrix X) from the heterogeneous gene expression data. Further, the estimated expression profiles are used to identify genes that are differentially expressed which is one of the major reasons for carrying out gene expression deconvolution analysis. In addition to the identification of the differentially expressed genes, performing the complete gene expression deconvolution is an attractive method that provides an alternative to the very expensive and time consuming manual approaches like LCM and flow cytometry for separating cells which often lead to an altered cell-type specific gene expression profiles. Unlike some previously proposed methods for gene expression data deconvolution, our method does not rely on any prior knowledge of the cell type proportions or the cell-type specific gene expression profiles.

In testing the performance of the proposed SMC method, we evaluated the method on simulated datasets and publicly available real datasets. From the results obtained in all the experiments, the proposed SMC method demonstrated a superior performance in terms of accuracy of the estimated model parameters and also in identifying differentially expressed genes as shown in the Results Section and in the S1 Supplementary Material, when compared to the two other methods.

Moreover, in mapping the estimated cell-type specific profiles (matrix ) to the true cell types, we defined a set of marker probesets which were defined from the gene expression data from pure samples, matrix . Although in the real settings, we have no access to these pure samples, a small number of cell-type specific markers are often available, for instance, [55] identified a set of markers for different immune subsets.

Finally, it was shown that PNMF and the MCMC methods are faster than the SMC method in terms of computational speed. However, when there is an option of parallelization of computational resources, the SMC method can be considerably improved in terms of the computational time.

Supporting information

S1 Supplementary Material. Supplementary Material for “A Sequential Monte Carlo Approach to Gene Expression Deconvolution”.

https://doi.org/10.1371/journal.pone.0186167.s001

(PDF)

References

1. Chu Y, Corey DR. RNA sequencing: platform selection, experimental design, and data interpretation. Nucleic acid therapeutics. 2012;22(4):271–274. pmid:22830413
- View Article
- PubMed/NCBI
- Google Scholar
2. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics. 2009;10(1):57–63. pmid:19015660
- View Article
- PubMed/NCBI
- Google Scholar
3. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278(5338):680–686. pmid:9381177
- View Article
- PubMed/NCBI
- Google Scholar
4. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998;9(12):3273–3297. pmid:9843569
- View Article
- PubMed/NCBI
- Google Scholar
5. Mischel PS, Cloughesy TF, Nelson SF. DNA-microarray analysis of brain cancer: molecular classification for therapy. Nature Reviews Neuroscience. 2004;5(10):782–792. pmid:15378038
- View Article
- PubMed/NCBI
- Google Scholar
6. Hanai T, Hamada H, Okamoto M. Application of bioinformatics for DNA microarray data to bioscience, bioengineering and medical fields. Journal of bioscience and bioengineering. 2006;101(5):377–384. pmid:16781465
- View Article
- PubMed/NCBI
- Google Scholar
7. Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, et al. Cell type–specific gene expression differences in complex tissues. Nature methods. 2010;7(4):287–289. pmid:20208531
- View Article
- PubMed/NCBI
- Google Scholar
8. Meng T, Chen H, Sun M, Wang H, Zhao G, Wang X. Identification of differential gene expression profiles in placentas from preeclamptic pregnancies versus normal pregnancies by DNA microarrays. Omics: a journal of integrative biology. 2012;16(6):301–311. pmid:22702245
- View Article
- PubMed/NCBI
- Google Scholar
9. Cleator SJ, Powles TJ, Dexter T, Fulford L, Mackay A, Smith IE, et al. The effect of the stromal component of breast tumours on prediction of clinical outcome using gene expression microarray analysis. Breast Cancer Research. 2006;8(3):1.
- View Article
- Google Scholar
10. Espina V, Heiby M, Pierobon M, Liotta LA. Laser capture microdissection technology. Expert review of molecular diagnostics. 2007;7(5):647–657. pmid:17892370
- View Article
- PubMed/NCBI
- Google Scholar
11. Fulwyler MJ. Electronic separation of biological cells by volume. Science. 1965;150(3698):910–911. pmid:5891056
- View Article
- PubMed/NCBI
- Google Scholar
12. Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature. 2010;467(7319):1114–1117. pmid:20981102
- View Article
- PubMed/NCBI
- Google Scholar
13. Frumkin D, Wasserstrom A, Itzkovitz S, Harmelin A, Rechavi G, Shapiro E. Amplification of multiple genomic loci from single cells isolated by laser micro-dissection of tissues. BMC biotechnology. 2008;8(1):1.
- View Article
- Google Scholar
14. Bhattacherjee V, Mukhopadhyay P, Singh S, Roberts EA, Hackmiller RC, Greene RM, et al. Laser capture microdissection of fluorescently labeled embryonic cranial neural crest cells. Genesis. 2004;39(1):58–64. pmid:15124228
- View Article
- PubMed/NCBI
- Google Scholar
15. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods. 2015;12(5):453–457. pmid:25822800
- View Article
- PubMed/NCBI
- Google Scholar
16. Gong T, Hartmann N, Kohane IS, Brinkmann V, Staedtler F, Letzkus M, et al. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PloS one. 2011;6(11):e27156. pmid:22110609
- View Article
- PubMed/NCBI
- Google Scholar
17. Clarke J, Seo P, Clarke B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics. 2010;26(8):1043–1049. pmid:20202973
- View Article
- PubMed/NCBI
- Google Scholar
18. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PloS one. 2009;4(7):e6098. pmid:19568420
- View Article
- PubMed/NCBI
- Google Scholar
19. Ogundijo OE, He D, Parida L. Performance evaluation of different encoding strategies for quantitative genetic trait prediction. In: Computational Advances in Bio and Medical Sciences (ICCABS), 2015 IEEE 5th International Conference on. IEEE; 2015. p. 1–6.
20. Lähdesmäki H, Dunmire V, Yli-Harja O, Zhang W, et al. In silico microdissection of microarray data from heterogeneous cell populations. Bmc Bioinformatics. 2005;6(1):1.
- View Article
- Google Scholar
21. Jacobsen M, Repsilber D, Gutschmidt A, Neher A, Feldmann K, Mollenkopf HJ, et al. Deconfounding microarray analysis. Methods of information in medicine. 2006;45(5):557–563. pmid:17019511
- View Article
- PubMed/NCBI
- Google Scholar
22. Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. Infection, Genetics and Evolution. 2012;12(5):913–921. pmid:21930246
- View Article
- PubMed/NCBI
- Google Scholar
23. Venet D, Pecasse F, Maenhaut C, Bersini H. Separation of samples into their constituents using gene expression data. Bioinformatics. 2001;17(suppl 1):S279–S287. pmid:11473019
- View Article
- PubMed/NCBI
- Google Scholar
24. Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, et al. Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach. BMC bioinformatics. 2010;11(1):1.
- View Article
- Google Scholar
25. Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems; 2001. p. 556–562.
26. Kim H, Park H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications. 2008;30(2):713–730.
- View Article
- Google Scholar
27. Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. 2010;26(20):2571–2577. pmid:20631160
- View Article
- PubMed/NCBI
- Google Scholar
28. Nguyen TLT, Septier F, Peters GW, Delignon Y. Efficient sequential Monte-Carlo samplers for Bayesian inference. IEEE Transactions on Signal Processing. 2016;64(5):1305–1319.
- View Article
- Google Scholar
29. Del Moral P, Doucet A, Jasra A. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006;68(3):411–436.
- View Article
- Google Scholar
30. Peters GW, Fan Y, Sisson SA. On sequential Monte Carlo, partial rejection control and approximate Bayesian computation. Statistics and Computing. 2012;22(6):1209–1222.
- View Article
- Google Scholar
31. Peters GW. Topics in sequential Monte Carlo samplers. M sc, University of Cambridge, Department of Engineering. 2005;.
32. Ogundijo OE, Elmas A, Wang X. Reverse engineering gene regulatory networks from measurement with missing values. EURASIP Journal on Bioinformatics and Systems Biology. 2017;2017(1):2. pmid:28127303
- View Article
- PubMed/NCBI
- Google Scholar
33. Wilkinson DJ. Bayesian methods in bioinformatics and computational systems biology. Briefings in bioinformatics. 2007;8(2):109–116. pmid:17430978
- View Article
- PubMed/NCBI
- Google Scholar
34. Doucet A, De Freitas N, Gordon N. Sequential Monte Carlo methods in practice Springer. New York. 2001;.
35. Doucet A, Godsill S, Andrieu C. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing. 2000;10(3):197–208.
- View Article
- Google Scholar
36. Kitagawa G. A self-organizing state-space model. Journal of the American Statistical Association. 1998; p. 1203–1215.
- View Article
- Google Scholar
37. Kitagawa G. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of computational and graphical statistics. 1996;5(1):1–25.
- View Article
- Google Scholar
38. Bayar B, Bouaynaya N, Shterenberg R. Probabilistic non-negative matrix factorization: theory and application to microarray data analysis. Journal of bioinformatics and computational biology. 2014;12(01):1450001. pmid:24467759
- View Article
- PubMed/NCBI
- Google Scholar
39. Stuart RO, Wachsman W, Berry CC, Wang-Rodriguez J, Wasserman L, Klacansky I, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(2):615–620. pmid:14722351
- View Article
- PubMed/NCBI
- Google Scholar
40. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. vol. 2. Chapman & Hall/CRC Boca Raton, FL, USA; 2014.
41. Neal RM. Annealed importance sampling. Statistics and Computing. 2001;11(2):125–139.
- View Article
- Google Scholar
42. Fearnhead P, Taylor BM, et al. An adaptive sequential Monte Carlo sampler. Bayesian analysis. 2013;8(2):411–438.
- View Article
- Google Scholar
43. Liu JS, Chen R. Blind deconvolution via sequential imputations. Journal of the american statistical association. 1995;90(430):567–576.
- View Article
- Google Scholar
44. Arulampalam MS, Maskell S, Gordon N, Clapp T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing. 2002;50(2):174–188.
- View Article
- Google Scholar
45. Särkkä S. Bayesian filtering and smoothing. vol. 3. Cambridge University Press; 2013.
46. Andrew H, Florence G, Kibria GB. Methods for Identifying Differentially Expressed Genes: An Empirical Comparison. Journal of Biometrics & Biostatistics. 2015;6(5):1.
- View Article
- Google Scholar
47. Zhong Y, Wan YW, Pang K, Chow LM, Liu Z. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC bioinformatics. 2013;14(1):1.
- View Article
- Google Scholar
48. Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS letters. 2004;573(1-3):83–92. pmid:15327980
- View Article
- PubMed/NCBI
- Google Scholar
49. McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25(6):765–771. pmid:19176553
- View Article
- PubMed/NCBI
- Google Scholar
50. Draghici S, Khatri P, Eklund AC, Szallasi Z. Reliability and reproducibility issues in DNA microarray measurements. TRENDS in Genetics. 2006;22(2):101–109. pmid:16380191
- View Article
- PubMed/NCBI
- Google Scholar
51. Carter SL, Eklund AC, Mecham BH, Kohane IS, Szallasi Z. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC bioinformatics. 2005;6(1):1.
- View Article
- Google Scholar
52. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(16):6062–6067. pmid:15075390
- View Article
- PubMed/NCBI
- Google Scholar
53. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. pmid:12925520
- View Article
- PubMed/NCBI
- Google Scholar
54. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, et al. NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic acids research. 2011;39(suppl 1):D1005–D1010. pmid:21097893
- View Article
- PubMed/NCBI
- Google Scholar
55. Abbas A, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data. Genes and immunity. 2005;6(4):319–331. pmid:15789058
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Chu Y, Corey DR. RNA sequencing: platform selection, experimental design, and data interpretation. Nucleic acid therapeutics. 2012;22(4):271–274. pmid:22830413
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics. 2009;10(1):57–63. pmid:19015660
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997;278(5338):680–686. pmid:9381177
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998;9(12):3273–3297. pmid:9843569
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Mischel PS, Cloughesy TF, Nelson SF. DNA-microarray analysis of brain cancer: molecular classification for therapy. Nature Reviews Neuroscience. 2004;5(10):782–792. pmid:15378038
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Hanai T, Hamada H, Okamoto M. Application of bioinformatics for DNA microarray data to bioscience, bioengineering and medical fields. Journal of bioscience and bioengineering. 2006;101(5):377–384. pmid:16781465
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, et al. Cell type–specific gene expression differences in complex tissues. Nature methods. 2010;7(4):287–289. pmid:20208531
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Meng T, Chen H, Sun M, Wang H, Zhao G, Wang X. Identification of differential gene expression profiles in placentas from preeclamptic pregnancies versus normal pregnancies by DNA microarrays. Omics: a journal of integrative biology. 2012;16(6):301–311. pmid:22702245
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Cleator SJ, Powles TJ, Dexter T, Fulford L, Mackay A, Smith IE, et al. The effect of the stromal component of breast tumours on prediction of clinical outcome using gene expression microarray analysis. Breast Cancer Research. 2006;8(3):1.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref10] 10. Espina V, Heiby M, Pierobon M, Liotta LA. Laser capture microdissection technology. Expert review of molecular diagnostics. 2007;7(5):647–657. pmid:17892370
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Fulwyler MJ. Electronic separation of biological cells by volume. Science. 1965;150(3698):910–911. pmid:5891056
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature. 2010;467(7319):1114–1117. pmid:20981102
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Frumkin D, Wasserstrom A, Itzkovitz S, Harmelin A, Rechavi G, Shapiro E. Amplification of multiple genomic loci from single cells isolated by laser micro-dissection of tissues. BMC biotechnology. 2008;8(1):1.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref14] 14. Bhattacherjee V, Mukhopadhyay P, Singh S, Roberts EA, Hackmiller RC, Greene RM, et al. Laser capture microdissection of fluorescently labeled embryonic cranial neural crest cells. Genesis. 2004;39(1):58–64. pmid:15124228
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nature methods. 2015;12(5):453–457. pmid:25822800
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref16] 16. Gong T, Hartmann N, Kohane IS, Brinkmann V, Staedtler F, Letzkus M, et al. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PloS one. 2011;6(11):e27156. pmid:22110609
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref17] 17. Clarke J, Seo P, Clarke B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics. 2010;26(8):1043–1049. pmid:20202973
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref18] 18. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PloS one. 2009;4(7):e6098. pmid:19568420
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref19] 19. Ogundijo OE, He D, Parida L. Performance evaluation of different encoding strategies for quantitative genetic trait prediction. In: Computational Advances in Bio and Medical Sciences (ICCABS), 2015 IEEE 5th International Conference on. IEEE; 2015. p. 1–6.

[ref20] 20. Lähdesmäki H, Dunmire V, Yli-Harja O, Zhang W, et al. In silico microdissection of microarray data from heterogeneous cell populations. Bmc Bioinformatics. 2005;6(1):1.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref21] 21. Jacobsen M, Repsilber D, Gutschmidt A, Neher A, Feldmann K, Mollenkopf HJ, et al. Deconfounding microarray analysis. Methods of information in medicine. 2006;45(5):557–563. pmid:17019511
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. Infection, Genetics and Evolution. 2012;12(5):913–921. pmid:21930246
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref23] 23. Venet D, Pecasse F, Maenhaut C, Bersini H. Separation of samples into their constituents using gene expression data. Bioinformatics. 2001;17(suppl 1):S279–S287. pmid:11473019
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref24] 24. Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, et al. Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach. BMC bioinformatics. 2010;11(1):1.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref25] 25. Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems; 2001. p. 556–562.

[ref26] 26. Kim H, Park H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications. 2008;30(2):713–730.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref27] 27. Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. 2010;26(20):2571–2577. pmid:20631160
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref28] 28. Nguyen TLT, Septier F, Peters GW, Delignon Y. Efficient sequential Monte-Carlo samplers for Bayesian inference. IEEE Transactions on Signal Processing. 2016;64(5):1305–1319.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref29] 29. Del Moral P, Doucet A, Jasra A. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006;68(3):411–436.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref30] 30. Peters GW, Fan Y, Sisson SA. On sequential Monte Carlo, partial rejection control and approximate Bayesian computation. Statistics and Computing. 2012;22(6):1209–1222.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref31] 31. Peters GW. Topics in sequential Monte Carlo samplers. M sc, University of Cambridge, Department of Engineering. 2005;.

[ref32] 32. Ogundijo OE, Elmas A, Wang X. Reverse engineering gene regulatory networks from measurement with missing values. EURASIP Journal on Bioinformatics and Systems Biology. 2017;2017(1):2. pmid:28127303
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref33] 33. Wilkinson DJ. Bayesian methods in bioinformatics and computational systems biology. Briefings in bioinformatics. 2007;8(2):109–116. pmid:17430978
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref34] 34. Doucet A, De Freitas N, Gordon N. Sequential Monte Carlo methods in practice Springer. New York. 2001;.

[ref35] 35. Doucet A, Godsill S, Andrieu C. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing. 2000;10(3):197–208.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref36] 36. Kitagawa G. A self-organizing state-space model. Journal of the American Statistical Association. 1998; p. 1203–1215.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref37] 37. Kitagawa G. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of computational and graphical statistics. 1996;5(1):1–25.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref38] 38. Bayar B, Bouaynaya N, Shterenberg R. Probabilistic non-negative matrix factorization: theory and application to microarray data analysis. Journal of bioinformatics and computational biology. 2014;12(01):1450001. pmid:24467759
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref39] 39. Stuart RO, Wachsman W, Berry CC, Wang-Rodriguez J, Wasserman L, Klacansky I, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(2):615–620. pmid:14722351
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref40] 40. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. vol. 2. Chapman & Hall/CRC Boca Raton, FL, USA; 2014.

[ref41] 41. Neal RM. Annealed importance sampling. Statistics and Computing. 2001;11(2):125–139.
View Article
Google Scholar

[136] View Article

[137] Google Scholar

[ref42] 42. Fearnhead P, Taylor BM, et al. An adaptive sequential Monte Carlo sampler. Bayesian analysis. 2013;8(2):411–438.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref43] 43. Liu JS, Chen R. Blind deconvolution via sequential imputations. Journal of the american statistical association. 1995;90(430):567–576.
View Article
Google Scholar

[142] View Article

[143] Google Scholar

[ref44] 44. Arulampalam MS, Maskell S, Gordon N, Clapp T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing. 2002;50(2):174–188.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref45] 45. Särkkä S. Bayesian filtering and smoothing. vol. 3. Cambridge University Press; 2013.

[ref46] 46. Andrew H, Florence G, Kibria GB. Methods for Identifying Differentially Expressed Genes: An Empirical Comparison. Journal of Biometrics & Biostatistics. 2015;6(5):1.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref47] 47. Zhong Y, Wan YW, Pang K, Chow LM, Liu Z. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC bioinformatics. 2013;14(1):1.
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref48] 48. Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS letters. 2004;573(1-3):83–92. pmid:15327980
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref49] 49. McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25(6):765–771. pmid:19176553
View Article
PubMed/NCBI
Google Scholar

[159] View Article

[160] PubMed/NCBI

[161] Google Scholar

[ref50] 50. Draghici S, Khatri P, Eklund AC, Szallasi Z. Reliability and reproducibility issues in DNA microarray measurements. TRENDS in Genetics. 2006;22(2):101–109. pmid:16380191
View Article
PubMed/NCBI
Google Scholar

[163] View Article

[164] PubMed/NCBI

[165] Google Scholar

[ref51] 51. Carter SL, Eklund AC, Mecham BH, Kohane IS, Szallasi Z. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC bioinformatics. 2005;6(1):1.
View Article
Google Scholar

[167] View Article

[168] Google Scholar

[ref52] 52. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(16):6062–6067. pmid:15075390
View Article
PubMed/NCBI
Google Scholar

[170] View Article

[171] PubMed/NCBI

[172] Google Scholar

[ref53] 53. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249–264. pmid:12925520
View Article
PubMed/NCBI
Google Scholar

[174] View Article

[175] PubMed/NCBI

[176] Google Scholar

[ref54] 54. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, et al. NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic acids research. 2011;39(suppl 1):D1005–D1010. pmid:21097893
View Article
PubMed/NCBI
Google Scholar

[178] View Article

[179] PubMed/NCBI

[180] Google Scholar

[ref55] 55. Abbas A, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data. Genes and immunity. 2005;6(4):319–331. pmid:15789058
View Article
PubMed/NCBI
Google Scholar

[182] View Article

[183] PubMed/NCBI

[184] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Likelihood function

Prior densities for all model parameters

Prior densities for the cell-type specific expressions.

Prior densities for the cell type proportions.

Prior density for the precision.

Sequential Monte Carlo samplers for Bayesian inference

General principle of SMC samplers.

Resampling procedure.

Target distributions, forward and backward kernels specification for gene expression deconvolution.

SMC sampler algorithm for gene expression deconvolution

Results

Ground-truth for variables

Ground-truth for the cell types proportions and the cell-type specific expression profiles (matrices M and X).

List of differentially and non-differentially expressed genes.

Cell types mapping and marker probesets.

Metrics for comparing results.

Simulated dataset

Affymetrix dataset: 2 cell types

GEO series GSE19830 dataset: 3 cell types

GEO series GSE11058 dataset: 4 cell types

Discussion

Supporting information

S1 Supplementary Material. Supplementary Material for “A Sequential Monte Carlo Approach to Gene Expression Deconvolution”.

References