Yes, limpa can operate on a matrix of log-intensities where each row corresponds to a protein modification and each column is a sample. For PTM analysis, each row is a unique feature to be retained for downstream analysis, so one runs dpcQuantByRow() instead of dpcQuant(). Otherwise, the limpa pipeline remains much the same.
limpa can be used in conjunction with limma’s diffSplice() function to detect changes in the relative expression of different peptides for the same protein (differential peptide usage). An example analysis is shown here: https://smythlab.github.io/limpa/DiffPeptideUsage.html
No, the main limpa pipeline is not a type of imputation. limpa produces a complete log-intensity matrix without NAs, but it does so by estimating protein expression parameters (log2-intensities) rather than by imputation. limpa does not treat the estimated expression parameters as if they were regular observations.
limpa avoids several problems that are endemic to imputation methods. First, it doesn’t increase the error rate by over-stating the amount of information available in the observations. Second, it evaluates DE for each protein based on the data (the observed intensities and the pattern of missing values) for that particular protein. It does not infer that a protein might be up or down regulated simply because other proteins in the same dataset are up or down regulated (as is done by imputation methods based on MAR models, see https://smythlab.github.io/limpa/HYE100-Unbalanced.html). Third, it does not insert random numbers, so results are completely reproducible and do not depend on an arbitrary random seed.
The limpa package does include a novel imputation function, however, called imputeByExpTilt(), which is used to get starting values when estimating the limpa model hyperparameters.
No, limpa’s DE analysis cannot be reproduced using just the matrix of the protein quantifications. limpa also uses the standard errors and the observations count associated with each protein quant to model the variances and to do an optimally weighted linear model analysis. You can’t get the same benefit by simply exporting the matrix of protein quantification to an ordinary limma analysis, say.
limpa produces protein quantifications and also undertakes a specialized DE analysis. The DE analysis cannot be reproduced using the protein quantifications alone, but the protein quantifications are nevertheless intended for summary plotting purposes and for export to other downstream analyses that require complete data, like clustering or heatmaps or network analysis. The limma protein quants are intended to replace maxLFQ and are intended to make imputation unnecessary. Using the limpa protein quants without the standard errors is not optimal but, even so, we think the limpa quants are nevertheless a better export product than other options.
The situation is similar for other specialized DE software tools for both RNA-seq and mass spec. In the RNA-seq world, limma-voom, edgeR and DESeq2 all produce expression summaries for each gene in each sample, but one cannot reproduce their DE analyses using the expression summaries. In each case, the expression summaries are used for graphical and export purposes. In each case, the expression summaries are complete and include non-NA values even for genes that are entirely undetected in a sample or a group.
In the mass spectrometry world, limpa, MSqRob and MSstats all produce protein quantifications from the precursor or fragment data but, in each case, the DE analysis done by those tools cannot be reproduced using the protein quants alone. (The quants from MSqRob and MSstats will contain some NAs though unless used in conjunction with imputation.)
It is important for limpa to produce results even for proteins that are entirely missing in a group. The fact that a protein is entirely undetected in an experimental condition is the strongest possible evidence that the protein is down-regulated in that group. Non detection does not mean that the protein is necessarily absent, but does provide very strong evidence that the expression level is low. Putting protein quants for non-detected proteins to NA would make it impossible to detect DE for those proteins that actually have the largest logFCs.
limpa produces large standard errors for proteins that are completely missing. limpa will generate significant DE when a protein is missing in one group but highly expressed in another. The results will not be significant however if the protein is missing in one group and lowly expressed in the other, where high and low expression is defined relative to the DPC. This seems to us to be correct behavior.
If a protein is entirely missing in several samples (i.e., all the precursors for that protein are missing in those samples), then it will be quantified by dpcQuant() to exactly the same value for all of those samples. This is because identical input should produce identical output. There is no evidence in the data for making one value larger or smaller than another, so they are all the same. The quantified value will, however, tend to be smaller than the observed values for the same protein, especially so if the observed values are reasonably large. The quantified value will also tend to decrease as the number of entirely missing samples increases.
Equal values do not cause any problems for downstream analyses for two main reasons. First, the uncertainty with which the protein values are estimated is quantified into standard errors stored in the dpcQuant() output object, and these standard errors are propagated to downstream analyses by plotMDSUsingSEs() and dpcDE(). Protein quants based on entirely missing values get large standard errors and are appropriately downweighted in subsequent analyses. Second, dpcDE() keeps track of which protein quants are based on entirely missing values (n.observations==0) and ensures that these values, if they cluster in a group, are not used to compute residual standard deviations used by limma in the linear model fitting. This ensures that the repeated values for a particular protein do not bias the variance estimation or the DE analysis.
limpa analyses can be run on a laptop, even for 100s or 1000s of samples. With 100s of samples, the dpcQuant() step may take an hour or two on a laptop, but other steps are quite fast.
With 1000s of samples, we recommend an alternative pipeline using impByExpTilt().