DSection manual

Overview

Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in experiments that focus on studying cell types, e.g. their expression profiles, in isolation. Although sample heterogeneity can be addressed by manual microdissection, prior to conducting experiments, computational treatment on heterogeneous measurements have become a reliable alternative to perform this microdissection in silico. Favoring computation over manual purification has its advantages, such as time consumption, measuring responses of multiple cell types simultaneously, keeping samples intact of external perturbations and unaltered yield of molecular content.

Model description
DSection is a probabilistic model, devised for estimating cell-type and experimental condition specific expression profiles in heterogeneous tissues. Unlike existing models, DSection assumes that prior knowledge on cell type proportions in the tissues contains uncertainty, and with Markov chain Monte Carlo sampling techniques improving proportion estimation -- along with better overall model fit -- has been demonstrated.

For gene i in tissue j under experimental condition c(j), DSection for expression measurement yij is

yij = ∑xtic(j)ptj + εij,     εij ~ N(0,1/λi),

where the sum is carried over all cell types t = 1,2,...,T; xtic(j) is the pure cell-type and experimental condition specific expression of gene i, and ptj is the cell-type proportion.

Parameter estimation
DSection is built upon a hybrid Gibbs and Metropolis-Hastings sampler that is used for estimating pure cell-type and experimental condition specific expression profiles (xtic's), cell-type proportions (ptj's), and precisions (λi's). Thus, for all these parameters priors densities reflecting their assumed uncertainty need to be assigned. We assign the following prior densities:

xtic ~ N(μtic,1/ν),     pj ~ Dirichlet(w0p0j),     λi ~ Gamma(α,β),

where w0 is a positive weight reflecting the confidence on prior cell-type proportions, pj; see more details about justification and use of prior densities on the article and supplementary material (derivations for all posterior densities are provided).

Inputs

Example
Consider the following hypothetical situation. You are given expression data of eight prostate tissue samples containing varying degrees of cancer-, epithelial-, and stromal cells. You know that sample heterogeneity is surely affecting the expression profiles, as per H&E staining tissues were known to be highly heterogeneous, and expression is known to be cell-type dependent. Moreover, four of the samples are controls (not treated) and the other four were from patients that were given a specific prostate cancer suppressor drug, so a natural experimental design would aim for testing for differential expression between both cell-types and treatments. To correct for the confounding effects of sample heterogeneity, a qualified pathologist has assessed cell-type contents in each of the samples based on available H&E stained images, and DSection is used for computationally reconstructing cell-type specific expression profiles of the two genes of interest, "EGFR" and "AR". How the expression and prior cell-type proportion data are given for DSection, it is discussed next.

Formatting expression data
The user is responsible of uploading two tab-delimited data files, one for expression data and one for prior cell-type proportions. Rows contain probe identifiers (e.g. gene symbols) and columns experimental conditions as headers (column containing lines starting with // are comments and are not included in the data table):
Control Control Treatment Treatment Treatment Treatment Control Control //"Control" and "Treatment" are used as unique identifiers for the two experimental conditions
EGFR 1.5 2.3 3.3 2.5 4.1 4.1 0.9 0.8 //Expression profile of "EGFR" over all tissue samples
AR 3.3 2.5 1.1 1.2 3.1 1.3 6.1 3.5 //Expression profile of "AR" over all tissue samples

Formatting proportion data
Data table containing prior cell-type proportions is formatted in the similar fashion; rows contain cell-type identifiers and columns experimental conditions as headers (make sure the column identifiers match with those in the expression table and are in the same order):
Control Control Treatment Treatment Treatment Treatment Control Control //"Control" and "Treatment" need to be in the same order as in the expression data table
Cancer 0.4 0.4 0.2 0.1 0.4 0.1 0.7 0.2 //"Cancer" is used as unique cell-type idntifier
Epithelium 0.2 0.1 0.6 0.1 0.3 0.3 0.0 0.5 //"Epithelium" is used as unique cell-type idntifier
Stroma 0.4 0.5 0.2 0.8 0.3 0.6 0.3 0.3 //"Stroma" is used as unique cell-type idntifier

Sampler parameters
As said, the unknown parameters in the model are estimated with a hybrid Gibbs- and Metropolis-Hastings sampler, one type of a Markov chain sampler. From Markov chain sampling theory, parameter estimates (here: sample means, also known as Markov chain Monte Carlo estimates) are extracted from posterior distributions of the respective parameter space; and in order to attain that posterior distribution from practically any initialization of the sampler, some amount of samples, also known as burn-in samples, need to be discarded. We denote the amount of burn-in by B, and amount of posterior sampling by S; together, B+S is the total number of samples drawn, but only last S samples are used for estimating parameters.

Furthermore, Metropolis-Hastings part of the hybrid sampler requires one extra parameter, transition weight (wp), that defines how correlated newly drawn samples are with respect to old samples; high wp guarantees high correlation (in Bayesian terms: small mixing fraction of the chain) and vice versa.

Unless you are familiar with Bayesian models and sampling theory, we suggest to leave B, S, and wp to their default values (2000, 5000, and 100, respectively).

Outputs

Estimated cell-type and experimental condition specific expression profiles (E), standard deviations (S), and samples sizes (N)
Output table is formatted so that a triplet of columns (E = "Expression estimate", S = "Standard deviation", N = "Sample size") represents one cell-type and experimental condition pair; for clarity, triplets are also labeled with color in this example. Cell-type and experimental condition identifier pairs are put inside parentheses; e.g., expression estimates of cancer cells in control samples are denoted by "E(C,C)", and standard deviations of epithelial cells in treated samples by "S(E,T)".
E(C,C) S(C,C) N(C,C) E(E,C) S(E,C) N(E,C) E(S,C) S(S,C) N(S,C) E(C,T) S(C,T) N(C,T) E(E,T) S(E,T) N(E,T) E(S,T) S(S,T) N(S,T)
EGFR 1.5 0.2 2 1.8 0.4 2 1.5 2.3 2 2.5 4.1 2 0.8 0.8 2 0.8 0.8 2
AR 3.3 0.6 2 3.3 0.5 2 2.5 1.1 2 3.1 0.9 2 1.3 0.2 2 1.3 0.1 2

Estimated cell-type proportions
A table of the same type as the prior cell-type proportions are in is outputted.

Statistical testing

In this setting, a natural statistical test is to assess whether expression of a gene is different between cell-types and/or experimental conditions. Suppose one is interested in finding how expression of AR depends on the cell-type and, more interestingly, cancer treatment; one example of how it could be done is described next.

Future implementations of DSection may include statistical testing, at the moment that functionality is not available.

Two-sample t-test
The two-sample test statistics is

T("case1" vs "case2") = (μ12) / {sqrt[((n1-1)s12+((n2-1)s22) / (n1+n2-2)]sqrt[1/n1+1/n2]};

thus, if AR expression in cancer cells for control samples (E(C,C)) were tested against AR expression in cancer cells for treated samples (E(C,T)), the test statistic would in this hypothetical example be

T(E(C,C) vs E(C,T)) = 0.2 / sqrt[1.17/2] = 0.26148818,

which at 95% significance level would not indicate statistically significant differential expression between the two cases.

Estimating false discovery rate
In real situations multiple statistical tests are carried out simultaneously. In the previous hypothetical example for instance, statistical testing between all occurring cell-type and experimental condition pairs could be considered, thus creating (6 nCr 2) = (6*5)/(1*2) = 15 p-values per gene, and with the two genes the total number of tests is therefore 30. This of course is not much, but assuming there are thousands of genes whose differential expression between cell-types and experimental conditions is to be assessed, it is apparent that some techniques for estimating false discovery rate could be applied, for better studying statistically significant differential expression by controlling the amount of false positives.

Based on our experimentation with both simulated and real data, Storey's fdr estimation with q-values gives reasonable results; whether Storey's method will work or not can rather straightforwardly be assessed by looking at the corresponding p-value histograms (thus, for reliable assessment, lots of p-values are required). For Storey's method to work, the histogram should look flat except that, if differential expression indeed is present, near p = 0 should be a distinguishable peak. However, the presence or absence of such peak signifying differential expression is not required for Storey's method to work; in such cases where the peak is absent, subsequent fdr estimates would be close to 1, i.e., chances for making false positive calls is close to 100%.