Model description
DSection is a probabilistic model, devised for estimating cell-type and experimental condition specific expression profiles in heterogeneous tissues. Unlike existing models, DSection assumes that prior knowledge on cell type proportions in the tissues contains uncertainty, and with Markov chain Monte Carlo sampling techniques improving proportion estimation -- along with better overall model fit -- has been demonstrated.
For gene i in tissue j under experimental condition c(j), DSection for expression measurement yij is
yij = ∑xtic(j)ptj + εij, εij ~ N(0,1/λi),
where the sum is carried over all cell types t = 1,2,...,T; xtic(j) is the pure cell-type and experimental condition specific expression of gene i, and ptj is the cell-type proportion.
Parameter estimation
DSection is built upon a hybrid Gibbs and Metropolis-Hastings sampler that is used for estimating pure cell-type and experimental condition specific expression profiles (xtic's), cell-type proportions (ptj's), and precisions (λi's). Thus, for all these parameters priors densities reflecting their assumed uncertainty need to be assigned. We assign the following prior densities:
xtic ~ N(μtic,1/ν), pj ~ Dirichlet(w0p0j), λi ~ Gamma(α,β),
where w0 is a positive weight reflecting the confidence on prior cell-type proportions, pj; see more details about justification and use of prior densities on the article and supplementary material (derivations for all posterior densities are provided).
Formatting expression data
The user is responsible of uploading two tab-delimited data files, one for expression data and one for prior cell-type proportions. Rows contain probe identifiers (e.g. gene symbols) and columns experimental conditions as headers (column containing lines starting with // are comments and are not included in the data table):
Control | Control | Treatment | Treatment | Treatment | Treatment | Control | Control | //"Control" and "Treatment" are used as unique identifiers for the two experimental conditions | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 1.5 | 2.3 | 3.3 | 2.5 | 4.1 | 4.1 | 0.9 | 0.8 | //Expression profile of "EGFR" over all tissue samples |
AR | 3.3 | 2.5 | 1.1 | 1.2 | 3.1 | 1.3 | 6.1 | 3.5 | //Expression profile of "AR" over all tissue samples |
Formatting proportion data
Data table containing prior cell-type proportions is formatted in the similar fashion; rows contain cell-type identifiers and columns experimental conditions as headers (make sure the column identifiers match with those in the expression table and are in the same order):
Control | Control | Treatment | Treatment | Treatment | Treatment | Control | Control | //"Control" and "Treatment" need to be in the same order as in the expression data table | |
---|---|---|---|---|---|---|---|---|---|
Cancer | 0.4 | 0.4 | 0.2 | 0.1 | 0.4 | 0.1 | 0.7 | 0.2 | //"Cancer" is used as unique cell-type idntifier |
Epithelium | 0.2 | 0.1 | 0.6 | 0.1 | 0.3 | 0.3 | 0.0 | 0.5 | //"Epithelium" is used as unique cell-type idntifier |
Stroma | 0.4 | 0.5 | 0.2 | 0.8 | 0.3 | 0.6 | 0.3 | 0.3 | //"Stroma" is used as unique cell-type idntifier |
Sampler parameters
As said, the unknown parameters in the model are estimated with a hybrid Gibbs- and Metropolis-Hastings sampler, one type of a Markov chain sampler. From Markov chain sampling theory, parameter estimates (here: sample means, also known as Markov chain Monte Carlo estimates) are extracted from posterior distributions of the respective parameter space; and in order to attain that posterior distribution from practically any initialization of the sampler, some amount of samples, also known as burn-in samples, need to be discarded. We denote the amount of burn-in by B, and amount of posterior sampling by S; together, B+S is the total number of samples drawn, but only last S samples are used for estimating parameters.
Furthermore, Metropolis-Hastings part of the hybrid sampler requires one extra parameter, transition weight (wp), that defines how correlated newly drawn samples are with respect to old samples; high wp guarantees high correlation (in Bayesian terms: small mixing fraction of the chain) and vice versa.
Unless you are familiar with Bayesian models and sampling theory, we suggest to leave B, S, and wp to their default values (2000, 5000, and 100, respectively).
E(C,C) | S(C,C) | N(C,C) | E(E,C) | S(E,C) | N(E,C) | E(S,C) | S(S,C) | N(S,C) | E(C,T) | S(C,T) | N(C,T) | E(E,T) | S(E,T) | N(E,T) | E(S,T) | S(S,T) | N(S,T) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EGFR | 1.5 | 0.2 | 2 | 1.8 | 0.4 | 2 | 1.5 | 2.3 | 2 | 2.5 | 4.1 | 2 | 0.8 | 0.8 | 2 | 0.8 | 0.8 | 2 |
AR | 3.3 | 0.6 | 2 | 3.3 | 0.5 | 2 | 2.5 | 1.1 | 2 | 3.1 | 0.9 | 2 | 1.3 | 0.2 | 2 | 1.3 | 0.1 | 2 |
Estimated cell-type proportions
A table of the same type as the prior cell-type proportions are in is outputted.
Two-sample t-test
The two-sample test statistics is
T("case1" vs "case2") = (μ1-μ2) / {sqrt[((n1-1)s12+((n2-1)s22) / (n1+n2-2)]sqrt[1/n1+1/n2]};
thus, if AR expression in cancer cells for control samples (E(C,C)) were tested against AR expression in cancer cells for treated samples (E(C,T)), the test statistic would in this hypothetical example be
T(E(C,C) vs E(C,T)) = 0.2 / sqrt[1.17/2] = 0.26148818,
which at 95% significance level would not indicate statistically significant differential expression between the two cases.
Estimating false discovery rate
In real situations multiple statistical tests are carried out simultaneously. In the previous hypothetical example for instance, statistical testing between all occurring cell-type and experimental condition pairs could be considered, thus creating (6 nCr 2) = (6*5)/(1*2) = 15 p-values per gene, and with the two genes the total number of tests is therefore 30. This of course is not much, but assuming there are thousands of genes whose differential expression between cell-types and experimental conditions is to be assessed, it is apparent that some techniques for estimating false discovery rate could be applied, for better studying statistically significant differential expression by controlling the amount of false positives.
Based on our experimentation with both simulated and real data, Storey's fdr estimation with q-values gives reasonable results; whether Storey's method will work or not can rather straightforwardly be assessed by looking at the corresponding p-value histograms (thus, for reliable assessment, lots of p-values are required). For Storey's method to work, the histogram should look flat except that, if differential expression indeed is present, near p = 0 should be a distinguishable peak. However, the presence or absence of such peak signifying differential expression is not required for Storey's method to work; in such cases where the peak is absent, subsequent fdr estimates would be close to 1, i.e., chances for making false positive calls is close to 100%.