This case study analyses the HYE100 hybrid proteome samples from Navarro et al (2016). Two samples, A and B, were prepared by mixing human, yeast and E. coli tryptic peptides in specified proportions. The human peptides was obtained from HeLa cells, which is a cervical cancer cell line. Sample A was composed of 67% human, 30% yeast, and 3% E. coli, while sample B was composed of 67% human, 3% yeast, and 30% E. coli. The human proteins should not be differentially expressed between A and B, while yeast proteins should be 10-fold up-regulated in A, and E. coli proteins should be 10-fold up-regulated in B. Three technical replicates were prepared of A and B, to make a two group experiment with \(n=3\) samples in each group.
Demichev et al (2020) re-processed the mass spectrometry data using DIA-NN version 1, and the DIA-NN report file with the precursor quantifications can be downloaded from https://doi.org/10.17605/OSF.IO/6G3UX.
A limpa analysis of the full dataset has been presented previously. In the new analysis here, we make the differential expression unbalanced between groups A and B by removing all but one of the yeast proteins from the dataset. The yeast protein that is kept is D3UEH2, which has 39 precursors in the datasets, all of which are entirely missing in the B group samples. In the full data analysis, D3UEH2 was the most significant down-regulated protein that was entirely missing in the B group. The protein is consistently detected in the A samples, so is very clearly differentially expressed. Any good DE method should detect it as down-regulated in the B samples, yet we show here that all the imputation methods find it to be not DE or, even worse, for it to be highly up-regulated instead of down-regulated.
The peptide precursor intensities are provided in a tab-delimited report file output by DIA-NN, which we can read into limpa using readDIANN():
> library(limpa)
> y.peptide <- readDIANN("Fig2-diann.tsv", run.column="File.Name",
+ q.columns=c("Q.Value","Protein.Q.Value"),
+ extra.columns="Protein.Group", format="tsv")
> dim(y.peptide)
[1] 41919 6
The samples are in two groups.
> Replicate <- factor(c(1,1,2,2,3,3))
> Group <- factor(c("A","B","A","B","A","B"))
> colnames(y.peptide) <- paste(Group, Replicate, sep=".")
All the E. coli proteins are up-regulated in the B group while the yeast proteins are down-regulated.
Add species annotation:
> PG <- y.peptide$genes$Protein.Group
> n <- nchar(PG)
> y.peptide$genes$Species <- substring(PG,n-4L,n)
> table(y.peptide$genes$Species)
ECOLI HUMAN YEAS8
12260 16882 12777
For the purpose of this example, we remove all the yeast proteins but one from the dataset. This makes a dataset where all the DE proteins are up-regulated except for one down-regulated protein.
> D3UEH2 <- unique(grep("D3UEH2",y.peptide$gene$Protein.Group, value=TRUE))
> i <- (y.peptide$genes$Species %in% c("HUMAN","ECOLI")) | y.peptide$gene$Protein.Group==D3UEH2
> y.peptide.ub <- y.peptide[i,]
> table(y.peptide.ub$genes$Species)
ECOLI HUMAN YEAS8
12260 16882 39
The yeast protein D2UEH2 that is retained, has 39 precursors in the dataset, which are mostly observed in the A group samples but all of which are entirely missing the B samples. The pattern of missing values provides very strong evidence of down-regulation, and any good DE method that makes full use of the available information should be able to detect it as strongly down-regulated.
> plotPeptides(y.peptide.ub, y.peptide.ub$genes$Protein.Group==D3UEH2,
+ main="D3UEH2 Precursors", step.down=1)
limpa has no trouble with the unbalanced dataset. The estimated detection probability curve is similar to that for the full dataset:
> dpcest <- dpcCN(y.peptide.ub)
Iter 1: dpc = -2.3863 0.7224
Iter 2: dpc = -2.3083 0.7025
> plotDPC(dpcest)
limpa estimates D3UEH2 to be highly down-regulated in the B samples, driven by the non-random pattern of missing values.
The protein log-intensity is much more precise in the A samples, reflecting the consistent observed values, and lower but more uncertain in the B samples, reflecting the missing values:
> y <- dpcQuant(y.peptide.ub, dpc=dpcest)
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Peptides: 8979
Proteins: 2000 Peptides: 17427
Proteins: 3000 Peptides: 24426
Proteins: 4000 Peptides: 28117
Proteins: 4547 Peptides: 29181
> plotProtein(y, D3UEH2)
The DE results for the retained proteins remain much the same as they were for the full dataset, and the yeast protein in particular remains highly significant and down-regulated.
> design <- model.matrix(~Group)
> fit <- dpcDE(y, design, plot=TRUE)
> fit <- eBayes(fit)
> tab <- topTable(fit, coef=2, n=Inf)
> head(tab,10)
Species NPeptides PropObs logFC AveExpr t P.Value adj.P.Val B
1/sp|P0A7N9|RL33_ECOLI ECOLI 2 1.0000 3.288 8.467 115.32 1.256e-15 3.968e-12 26.60
1/sp|P0A910|OMPA_ECOLI ECOLI 25 0.8400 3.299 7.846 111.20 1.745e-15 3.968e-12 26.35
1/sp|P69908|DCEA_ECOLI ECOLI 2 1.0000 3.389 8.296 106.15 2.655e-15 4.023e-12 25.86
1/sp|P0A8V2|RPOB_ECOLI ECOLI 78 0.7286 3.300 5.070 99.57 4.729e-15 5.376e-12 25.30
1/sp|P02768|ALBU_HUMAN HUMAN 2 1.0000 -2.552 9.525 -95.85 6.672e-15 5.545e-12 25.06
1/sp|P0A6F5|CH60_ECOLI ECOLI 65 0.7333 3.504 5.739 93.00 8.760e-15 5.545e-12 24.68
1/sp|P0A6M8|EFG_ECOLI ECOLI 49 0.8844 3.247 6.735 91.40 1.024e-14 5.545e-12 24.64
1/sp|P0A9Q7|ADHE_ECOLI ECOLI 81 0.7675 3.269 6.082 90.26 1.148e-14 5.545e-12 24.50
1/sp|P0A8T7|RPOC_ECOLI ECOLI 74 0.7455 3.185 5.331 89.65 1.219e-14 5.545e-12 24.42
1/sp|P0AB71|ALF_ECOLI ECOLI 37 0.7838 3.510 6.042 90.22 1.153e-14 5.545e-12 24.35
> tab[D3UEH2,]
Species NPeptides PropObs logFC AveExpr t P.Value adj.P.Val B
1/tr|D3UEH2|D3UEH2_YEAS8 YEAS8 39 0.4274 -6.574 1.587 -9.559 2.786e-05 0.0001836 2.359
We now demonstrate how general-purpose imputation methods fail to assess the missing value pattern correctly. We start with impSeq, which was recommended as the top-ranked imputation method by Wang et al (2020).
> library(rrcovNA)
> y.peptide.impSeq <- impSeq(y.peptide.ub$E)
> plotPeptides(y.peptide.impSeq[y.peptide.ub$genes$Protein.Group==D3UEH2,],
+ main="D3UEH2 precursors after impSeq imputation")
We see that impSeq has made the D3UEH2 precursors consistently up-regulated instead of down-regulated, reversing the true direction of change.
Any downstream DE analysis would conclude, incorrectly, that D3UEH2 is significantly up-regulated in the B samples.
Next we try BPCA, which was recommended by Dabke et al (2021). We set the number of principal components to 5, one less than the number of samples, as recommended by Dabke et al (2021) and Oba et al (2003), but the number chosen makes little difference to the results shown here.
> library(pcaMethods)
> y.peptide.BPCA <- completeObs(pca(y.peptide.ub$E, method="bpca", nPcs=5))
Step Number : 10
Increase in precision : 1.609
----------
Step Number : 20
Increase in precision : 0.007487
----------
Step Number : 30
Increase in precision : 0.0067
----------
Step Number : 40
Increase in precision : 0.004296
----------
Step Number : 50
Increase in precision : 0.003474
----------
Step Number : 60
Increase in precision : 0.003043
----------
Step Number : 70
Increase in precision : 0.00225
----------
Step Number : 80
Increase in precision : 0.00142
----------
Step Number : 90
Increase in precision : 0.0008257
----------
Step Number : 100
Increase in precision : 0.0004789
----------
> plotPeptides(y.peptide.BPCA[y.peptide.ub$genes$Protein.Group==D3UEH2,],
+ main="D3UEH2 precursors after BCPA imputation")
We see that, similarly to impSeq, BPCA has made most of the D3UEH2 precursors up-regulated instead of down-regulated.
Next we try random forests (RF), which was recommended by Jin et al (2021).
Note that we transpose the log-intensity matrix because missForest assumes samples to be rows instead of columns.
> library(missForest)
> y.peptide.RF <- t(missForest(t(y.peptide.ub$E))$ximp)
> plotPeptides(y.peptide.RF[y.peptide.ub$genes$Protein.Group==D3UEH2,],
+ main="D3UEH2 precursors after RF imputation")
We see that RF imputes B sample values that are similar to the A samples, so that, in any downstream DE analysis, D3UEH2 would not be significantly DE.
RF imputation reduces the amount of DE over the dataset generally, not just for the yeast protein.
We note also the random forests imputation step is far more time consuming than another of the other analyses shown on this page. BPCA is the second slowest, while limpa and all the methods are quite fast.
The v2-mnar imputation method (Hediyeh-Zadeh et al 2023) can’t be run on this dataset because it requires all precursors have to have least four non-missing observations. It is therefore unable to detect DE for any protein that is entirely missing in either the A or B groups.
This data example illustrates problems that are shared by many imputation methods. Like many other imputation methods, impSeq, BPCA and RF are based on statistical models that assume that missing values are missing at random, an assumption that is not realistic for mass spectrometry data. The assumption has two consequences. First, the imputation methods are unable to gain information about DE from the missing value pattern. Second, and worse, the imputation methods infer patterns from the majority of precursors instead of for each precursor or each protein specifically. Hence, the results for an atypical protein (such as D3UEH2 in our example) can be reversed to agree with the majority instead of agreeing with the actual data for that specific protein. In our opinion, this behaviour is scientifically indefensible, because it can lead to false conclusions about specific proteins.
Ironically, if the missing values really were missing at random, then the optimal DE strategy would be to simply ignore the missing values. If missing values were MAR, then imputation, no matter how mathematically sophisticated, could only add noise to the data and will cause DE methods to incorrectly assess the information in the data.
Our analysis here has not included imputation methods that assume informative missing value models, but such methods have their own problems.
In our opinion, imputation methods should not be used for DE analysis of mass spectrometry data because they have the effect of falsifying the data. The strategy of imputing missing values and then analysing the data as if the precursors or proteins were full observed is inherently flawed.
Dabke K, Kreimer S, Jones MR, Parker SJ (2021). A simple optimization workflow to enable precise and accurate imputation of missing values in proteomic data sets. Journal of Proteome Research 20(6), 3214-3229.
Hediyeh-Zadeh S, Webb AI, Davis MJ (2023). MsImpute: estimation of missing peptide intensity data in label-free quantitative mass spectrometry. Molecular & Cellular Proteomics 22(8), 100558.
Jin L, Bi Y, Hu C, Qu J, Shen S, Wang X, Tian Y (2021). A comparative study of evaluating missing value imputation methods in label-free proteomics. Scientific Reports 11(1), 1760.
Oba, S., Sato, M. A., Takemasa, I., Monden, M., Matsubara, K. I., & Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16), 2088-2096.
Wang S, Li W, Hu L, Cheng J, Yang H, Liu Y (2020). NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Research 48(14), e83.