Contents

1 The DPC plot doesn’t look like a good fit. Can I still go ahead with the analysis?

It is actually very hard to estimate the DPC from observed data, because the curve is a function of unobserved intensities rather than a function of the observed intensities. Datasets with outlier samples or a high proportion of DE peptides make it especially hard to estimate the curve.

We suggest dpcCN() instead of dpc() with noisy data, because it is more robust to outlier samples. It will give quite a different looking plot than dpc() because it plots the curve vs estimated intensities instead of observed intensities. The DPC slope should generally be in the range of 0.1–1.0. We think it depends on the technology and quantification software settings but doesn’t vary much from dataset to dataset for the same technology. A range of 0.7-0.9 is generally appropriate for DIA-NN data with match-between-runs. If dpcCN() doesn’t give a value in that range, it would be reasonable to run dpcQuant() with a preset value dpc.slope=0.7.

The visual fit does not need to be good for the analysis to remain valid. The limpa analysis is not fragile and will continue to give reasonable results even if the DPC slope is somewhat misspecified. If dpcQuant() is run with a DPC slope that is lower than the true value, then the analysis will become more conservative but it won’t fall over, and it will remain more powerful than simply ignoring the missing values.

2 Can limpa be used for phosphoproteomics or to analyse other post-translational modifications (PTMs)?

Yes, limpa can operate on a matrix of log-intensities where each row corresponds to a protein modification and each column is a sample. For PTM analysis, each row is a unique feature to be retained for downstream analysis, so one runs dpcQuantByRow() instead of dpcQuant(). Otherwise, the limpa pipeline remains much the same.

Note that limpa is not designed to analyse input-normalized PTM site quantities or PTM stoichiometry proportions from Spectronaut v20. Quantities input to limpa should be feature intensities rather than ratios of intensities.

3 Can limpa be used to analyse differential isoform usage?

limpa can be used in conjunction with limma’s diffSplice() function to detect changes in the relative expression of different peptides for the same protein (differential peptide usage). An example analysis is shown here: https://smythlab.github.io/limpa/DiffPeptideUsage.html

4 Is limpa equivalent to imputation?

No, the main limpa pipeline is not a type of imputation. limpa produces a complete log-intensity matrix without NAs, but it does so by estimating protein expression parameters (log2-intensities) rather than by imputation. limpa does not treat the estimated expression parameters as if they were regular observations.

limpa avoids several problems that are endemic to imputation methods. First, it doesn’t increase the error rate by over-stating the amount of information available in the observations. Second, it evaluates DE for each protein based on the data (the observed intensities and the pattern of missing values) for that particular protein. It does not infer that a protein might be up or down regulated simply because other proteins in the same dataset are up- or down-regulated (as is done by imputation methods based on MAR models, see https://smythlab.github.io/limpa/HYE100-Unbalanced.html). Third, it does not insert random numbers, so results are completely reproducible and do not depend on an arbitrary random seed.

The limpa package does include a novel imputation function, however, called imputeByExpTilt(), which is used to get starting values when estimating the limpa model hyperparameters.

5 Can I reproduce limpa’s DE analysis using the matrix of protein quantifications from dpcQuant?

dpcQuant() produces a matrix of protein log-intensities, together with standard errors and observation counts for each value, and limpa needs all three of these matrices (intensities, SEs and observation counts) for the DE analysis. It would not be quite correct to coerce the dpcQuant() output to a matrix and to input it to a regular limma analysis, ignoring the SEs and observation counts, because that would result in some loss of power and possibly a loss of error rate control. It is necessary to use dpcDE() on the intact output object from dpcQuant() to get the full benefit of limpa.

6 Can I use the matrix of protein quantifications for plots or clustering or other analyses?

limpa produces protein quantifications and also undertakes a specialized DE analysis. The DE analysis cannot be reproduced using the protein quantifications alone, but the protein quantifications are nevertheless intended for summary plotting purposes and for export to other downstream analyses that require complete data, like clustering or heatmaps or network analysis. The limpa protein quants are intended to replace maxLFQ and are intended to make imputation unnecessary. Using the limpa protein quants without the standard errors is not optimal but, even so, we think the limpa quants are nevertheless a better export product than other options.

The situation is similar for other specialized DE software tools for both RNA-seq and mass spectrometry. In the RNA-seq world, limma-voom, edgeR and DESeq2 all produce expression summaries for each gene in each sample, but one cannot reproduce their DE analyses using the expression summaries. In each case, the expression summaries are used for graphical and export purposes. In each case, the expression summaries are complete and include non-NA values even for genes that are entirely undetected in a sample or a group.

In the mass spectrometry world, limpa, MSqRob and MSstats all produce protein quantifications from the precursor or fragment data but, in each case, the DE analysis done by those tools cannot be reproduced using the protein quants alone. (The quants from MSqRob and MSstats will contain some NAs though unless used in conjunction with imputation.)

7 limpa produces results even for proteins that are completely undetected in a group. How is that possible?

It is important for limpa to produce results even for proteins that are entirely missing in a group. The fact that a protein is entirely undetected in an experimental condition is the strongest possible evidence that the protein is down-regulated in that group. Non-detection does not mean that the protein is necessarily absent, but does provide very strong evidence that the expression level is low. Putting protein quants for non-detected proteins to NA would make it impossible to detect DE for those proteins that actually have the largest logFCs.

limpa produces large standard errors for proteins that are completely missing. limpa will generate significant DE when a protein is missing in one group but highly expressed in another. The results will not be significant however if the protein is missing in one group and lowly expressed in the other, where high and low expression is defined relative to the DPC. This seems to us to be correct behavior.

8 limpa sometimes quantifies the same protein to identical values in different samples. Why does limpa do that and does it cause a problem for downstream analyses?

If a protein is entirely missing in several samples (i.e., all the precursors for that protein are missing in those samples), then it will be quantified by dpcQuant() to exactly the same value for all of those samples. This is because identical input should produce identical output. There is no evidence in the data for making one value larger or smaller than another, so they are all the same. The quantified value will, however, tend to be smaller than the observed values for the same protein, especially so if the observed values are reasonably large. The quantified value will also tend to decrease as the number of entirely missing samples increases.

Equal values do not cause any problems for downstream analyses for two main reasons. First, the uncertainty with which the protein values are estimated is quantified into standard errors stored in the dpcQuant() output object, and these standard errors are propagated to downstream analyses by plotMDSUsingSEs() and dpcDE(). Protein quants based on entirely missing values get large standard errors and are appropriately downweighted in subsequent analyses. Second, dpcDE() keeps track of which protein quants are based on entirely missing values (n.observations==0) and ensures that these values, if they cluster in a group, are not used to compute residual standard deviations used by limma in the linear model fitting. This ensures that the repeated values for a particular protein do not bias the variance estimation or the DE analysis.

9 How much memory and time does limpa use?

limpa has very modest memory usage, and limpa analyses can be run on a laptop, even for 100s or 1000s of samples. For example, the DIA-NN case study https://smythlab.github.io/limpa/HYE100-DIANN.html takes less than 15 seconds from start to finish and has a peak memory usage of less than 60MB on a Windows 11 laptop.

With 100s of samples, the dpcQuant() step may take an hour or two on a laptop, but other steps remain quite fast. With 1000s of samples, we recommend an alternative pipeline using impByExpTilt().

10 Does limpa assume that missing values are MAR or MNAR or both?

limpa assumes that all missing values are MNAR, for the reasons given in Li & Smyth (2023). The graduated probabilities given by the DPC mean that there is no need to make any false dichotomies about types of missing values.

It is important to note though that the common usage of the terms MAR, MCAR and MNAR in the proteomics literature does not agree with the original definitions of these terms in the statistics literature. In statistics, MCAR means that all missing values occur independently with the same probability. MAR means that one can put the missing values into known groups and the missing values are MCAR within each group. For example, missing values would be MAR if the probability of missingness differed between peptides, but missing values were MCAR for each peptide.

In statistics, the missing values are MNAR if there is any dependence, no matter how slight, on the underlying intensity. Any intensity dependence translates to MNAR. In the proteomics literature, however, it seems that “MAR” is used whenever there is any random component at all in the missingness, i.e., if missingness is not deterministic, and “MNAR” seems to be used mainly to refer to left-censoring situations. The statistical definitions are important because they correspond to ignorability, whereas the proteomics usages do not. The statistics, MAR or MCAR imply that the missing values can be ignored, whereas ignorability seems to have been not well understood in the proteomics literature.

Another confusion is that, in statistics, the terms MAR, MCAR, and MNAR do not correspond to physical mechanisms but only to probability distributions. For example, if there are both intensity-dependent and random physical mechanisms at play, but you don’t know for sure which mechanism caused which missing value then, in statistical terminology, the missing values are all MNAR, because the same unconditional probability distribution (the DPC) applies to all the values. The existence of a random but unpredictable missing value mechanism will tend to flatten the trend in the DPC but will not remove it.

In statistics, one cannot put the missing values into distinct classes (MAR vs MNAR) by examining the pattern of missing values for each peptide, because missing values will always be randomly distributed between samples if the peptide is non-DE, even if the missing values are MNAR.

In statistics, MAR does not mean that the missing values should be imputed by an imputation algorithm based on a MAR model. It rather means that there is no need for imputation at all. The missing values can just be ignored.

The limpa function fitZTLogit() allows one to check whether there are any missing value mechanisms at play that are not intensity dependent, the presence of which would change the asymptotic limit of the DPC. So far, we have not found good evidence for non-intensity dependent missing value mechanisms, although more checking will be done as we get more experience with single-cell proteomics.

11 Is limpa compatible with limma? Which limma functions can be used with limpa?

dpcQuant() produces an EList object, but with extra components storing standard errors and observation counts that are not part of the standard limma pipeline. limma functions that fit linear models (such as lmFit, arrayWeights, or vooma) should not be used on the limpa object because they will not recognize or use the standard errors and observation counts. limma’s plotMDS function is also replaced in limpa by plotMDSUsingSEs.

After running limpa’s dpcDE function, the resulting fitted model object is compatible with downstream limma functions. Any limma functions that work on fitted model objects can then be used on the dpcDE output, including contrasts.fit, eBayes, topTable, treat, decideTests, plotMD, diffSplice, goana, kegga, fitted and residuals.

In summary, output from dpcQuant() or dpcQuantByRows() is not compatible with limma functions, but output from dpcDE() is.

12 More documentation

More documentation and example analyses are available from https://github.com/SmythLab/limpa/.