This case study analyses the HYE100 hybrid proteome samples from Navarro et al (2016). Two samples, A and B, were prepared by mixing human, yeast and E. coli tryptic peptides in specified proportions. The human peptides was obtained from HeLa cells, which is a cervical cancer cell line. Sample A was composed of 67% human, 30% yeast, and 3% E. coli, while sample B was composed of 67% human, 3% yeast, and 30% E. coli. The human proteins should not be differentially expressed between A and B, while yeast proteins should be 10-fold up-regulated in A, and E. coli proteins should be 10-fold up-regulated in B. Three technical replicates were prepared of A and B, to make a two group experiment with \(n=3\) samples in each group.
Demichev et al (2020) re-processed the mass spectrometry data using Spectronaut, and the Spectronaut report file with the precursor quantifications can be downloaded from https://doi.org/10.17605/OSF.IO/6G3UX.
We use this data for two purposes. First, we use it as an example analysis using Spectronaut data. Second, we explore the data a little more to illustrate the advantages and limitations of a mixed-species dataset such as this.
This vignette was made with limpa version 1.3.7 and limma version 3.67.0.
The precursor ion intensities from Spectronaut are in the csv file Fig2.spectronaut.csv provided by Demichev et al (2020).
The file is similar to a standard report file output by Spectronaut, but with only six of the original columns retained.
We can import the report file into limpa using readSpectronaut:
> library(limpa)
Loading required package: limma
> y.prec <- readSpectronaut("spectronaut_report.tsv",
+ sample.column = "R.FileName",
+ precursor.column = c("EG.ModifiedPeptide","FG.Charge"),
+ intensity.column = "FG.Quantity",
+ q.columns = "EG.Qvalue",
+ annotation.columns = "PG.ProteinAccessions")
Imputation column EG.IsImputed not found.
Filtered 100185 q-values above q.cutoffs.
Filtered 19 values below lower intensity limit.
> dim(y.prec)
[1] 40827 6
This produces a dataset with 40827 rows (precursors) and 6 columns (samples).
Note that two columns corresponding to peptide and charge are specified by precursor.column to identify each precursor ion.
limpa appends the two columns together to construct a unique identifier for each precursor.
The samples are in two groups.
> Replicate <- factor(c(3,3,2,2,1,1))
> Group <- factor(c("B","A","B","A","B","A"))
> colnames(y.prec) <- paste(Group, Replicate, sep=".")
Add a species annotation column:
> PG <- y.prec$genes$PG.ProteinAccessions
> n <- nchar(PG)
> y.prec$genes$Species <- substring(PG,n-4L,n)
> table(y.prec$genes$Species)
ECOLI HUMAN YEAS8
11846 16681 12300
We find that a detection probability slope of about 0.7 is generally appropriate for current mass spectrometry technology and DIA-NN quantification,
and indeed the detection probability slope is estimated here to be about 0.7.
We estimate DPC here from the complete-normal model using dpcCN because it is robust to very large fold-changes in the data.
> dpcest <- dpcCN(y.prec)
> dpcest$dpc
beta0 beta1
-3.4012 0.7282
> plotDPC(dpcest)
Quantify protein expression:
> y <- dpcQuant(y.prec, dpc=dpcest, protein.id="PG.ProteinAccessions")
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Precursors: 8758 (21%)
Proteins: 2000 Precursors: 16910 (41%)
Proteins: 3000 Precursors: 23805 (58%)
Proteins: 4000 Precursors: 28594 (70%)
Proteins: 5000 Precursors: 35438 (86%)
Proteins: 6000 Precursors: 39468 (96%)
Proteins: 6507 Precursors: 40827 (100%)
> dim(y)
[1] 6507 6
At this stage, it is usual to filter out proteins that are insufficiently expressed. We will keep protein in the DE anaysis that are detected, at least once, in at least three samples:
> keep <- which(rowSums(y$other$n.observations > 0) >= 3)
> length(keep)
[1] 6080
> yfilt <- y[keep,]
The A and B samples separate very clearly, showing that there is considerable DE in the data:
> plotMDSUsingSEs(yfilt)
Now we form the design matrix and proceed to differential expression analysis. First, limpa identifies a clear variance trend that will be used in the empirical Bayes modelling:
> design <- model.matrix(~Group)
> fit <- dpcDE(yfilt, design, plot=TRUE)
Differential expression analysis identifies Yeast and E. coli samples as highly signficant:
> fit <- eBayes(fit)
> summary(dt <- decideTests(fit[,2]))
GroupB
Down 1797
NotSig 2795
Up 1488
> topTable(fit, coef=2, n=10)
Species NPrec PropObs logFC AveExpr t P.Value adj.P.Val B
1/sp|P0A910|OMPA_ECOLI ECOLI 25 0.8667 3.232 8.643 125.70 2.082e-17 1.266e-13 30.67
1/tr|C8ZD01|C8ZD01_YEAS8 YEAS8 2 1.0000 -3.652 9.676 -102.44 1.627e-16 4.945e-13 28.58
2/tr|C8ZDM8|C8ZDM8_YEAS8/tr|C8ZGS8|C8ZGS8_YEAS8 YEAS8 2 1.0000 -3.440 9.826 -94.13 3.807e-16 4.984e-13 27.79
1/tr|C8Z4U5|C8Z4U5_YEAS8 YEAS8 9 0.8889 -3.549 7.750 -93.32 4.154e-16 4.984e-13 27.63
2/sp|P33898|G3P3_ECOLI/sp|P0A9B2|G3P1_ECOLI ECOLI 1 1.0000 3.514 12.891 92.11 4.734e-16 4.984e-13 27.58
1/sp|P69908|DCEA_ECOLI ECOLI 2 1.0000 3.656 10.362 91.27 5.190e-16 4.984e-13 27.50
1/tr|C8ZDU8|C8ZDU8_YEAS8 YEAS8 9 0.8704 -3.571 7.984 -90.36 5.739e-16 4.984e-13 27.31
1/sp|P63284|CLPB_ECOLI ECOLI 62 0.8495 3.279 6.780 86.65 8.743e-16 6.002e-13 26.97
1/sp|P0A853|TNAA_ECOLI ECOLI 58 0.8075 3.129 7.934 86.51 8.884e-16 6.002e-13 26.96
1/tr|C8Z3S1|C8Z3S1_YEAS8 YEAS8 8 0.8542 -3.578 8.256 -83.53 1.264e-15 7.133e-13 26.60
A mean-difference plot shows that limpa recovers correct fold-changes for most protein, even single-precursor proteins and those at very low intensities:
> plotMD(fit, coef=2, status=fit$genes$Species, main="B vs A")
> abline(h=log2(10), lty=2)
> abline(h=0, lty=2)
> abline(h=-log2(10), lty=2)
The plot becomes clearer if we restrict to proteins with two or more precursors:
> i <- which(fit$genes$NPrec>1)
> plotMD(fit[i,], coef=2, status=fit$genes$Species[i], main="B vs A")
> abline(h=log2(10),lty=2)
> abline(h=0, lty=2)
> abline(h=-log2(10),lty=2)
One of the advantages of limpa is its ability to assess DE even for proteins that are entirely missing in one of the groups. We can compute the number of non-missing observations for each protein in each group:
> NObs.A <- rowSums(yfilt$other$n.observations[,Group=="A"])
> NObs.B <- rowSums(yfilt$other$n.observations[,Group=="B"])
For this data, 331 of the proteins that are significantly up are missing entirely in the A samples, and 486 of the proteins that are significantly down are missing entirely in the B samples:
> table(DE=dt, Nobs.A=pmin(NObs.A,5))
Nobs.A
DE 0 1 2 3 4 5
-1 0 3 4 260 40 1490
0 119 67 87 899 85 1538
1 330 164 84 192 52 666
> table(DE=dt, Nobs.B=pmin(NObs.B,5))
Nobs.B
DE 0 1 2 3 4 5
-1 486 192 99 260 71 689
0 193 105 100 822 59 1516
1 0 6 12 181 20 1269
We can confirm that these significant proteins are correct discoveries:
> table(fit$genes$Species[dt== 1 & NObs.A==0])
ECOLI
330
> table(fit$genes$Species[dt== -1 & NObs.B==0])
HUMAN YEAS8
1 485
We see the 331 up proteins are 100% correct discoveries and the 486 down proteins contain just one false discovery.
As as example, we pick out most significant of the latter proteins:
> PValue <- fit$p.value[,2]
> PGB <- names(which.min(PValue[NObs.B==0]))
Plot the protein quantifications with standard errors:
> plotProtein(y, PGB, main=PGB)
Now plot the corresponding precursor log-intensities.
This yeast protein has 15 precursors, all of which are missing in all the B samples.
In plot, closed circles indicate observations and open circles indicate missing values.
For the purposes of the plot, the missing values are plotted 0.5 units below the minimum observation for the same precursor:
> plotPeptides(y.prec[y.prec$genes$PG.ProteinAccessions==PGB,],
+ main=paste(PGB,"precursors"))
We did not do any extra normalization on top of that already done by Spectronaut. The human proteins seem equally distributed between samples, so no extra normalization seems necessary:
> isH <- which(y$genes$Species == "HUMAN")
> boxplot(as.data.frame(y$E[isH,]),
+ ylab="log2-Intensity", main="Human proteins")
As expected, the yeast proteins are lower in B samples:
> isY <- which(y$genes$Species == "YEAS8")
> boxplot(as.data.frame(y$E[isY,]),
+ ylab="log2-Intensity", main="Yeast proteins")
while the E. coli proteins are lower in A samples:
> isE <- which(y$genes$Species == "ECOLI")
> boxplot(as.data.frame(y$E[isE,]),
+ ylab="log2-Intensity", main="E coli proteins")
Now we explore the pattern of missing values at the precursor level:
> NMissing.A <- rowSums(is.na(y.prec$E[,Group=="A"]))
> NMissing.B <- rowSums(is.na(y.prec$E[,Group=="B"]))
> table(NMissing.A, y.prec$genes$Species)
NMissing.A ECOLI HUMAN YEAS8
0 2317 14943 9429
1 821 762 1445
2 1582 622 1313
3 7126 354 113
> table(NMissing.B, y.prec$genes$Species)
NMissing.B ECOLI HUMAN YEAS8
0 9575 14951 1509
1 1173 823 659
2 1019 588 1425
3 79 319 8707
We see that human precursors have similar numbers of missing values in A and B samples, while E. coli precursors have far more missing values in A and yeast precursors have far more missing values in B samples. This confirms that the missing values are intensity-dependent, because the missingness pattern correlates with global changes in precursor intensities.
This dataset was created by Navarro et al (2016) and Demichev et al (2020) in order to benchmark peptide quantification algorithms. The dataset is, however, different from real biological datasets in a number of ways that make it unsuitable for benchmarking differential expression statistical methods. One issue that is that the DE proteins and precursors have systematically lower intensities (and consequently more missing values) than the non-DE proteins and precursors. Another issue is that log-fold-changes are very large, while the variation between replicates is very small and not representative of biological variation. Another issue is that the dataset contains false positives, whereby some of the human proteins are clearly differentially expressed, although the experimental design should have ensured that they were not. A striking example of a DE human protein is albumin, for which both precursors are clearly down in the B samples:
> i <- grep("ALBU_HUMAN",y.prec$genes$PG.ProteinAccessions)
> plotPeptides(y.prec[i,], main="ALBU_HUMAN precursors")
The fact that human albumin appears DE is an artefact of the dataset rather than an error of the statistical testing method.
Thanks to Julia Marchingo for suggesting a fix for reading the Spectronaut file.
More documentation and example analyses are available from https://github.com/SmythLab/limpa/.
Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 17(1), 41-44.
Navarro P, Kuharev J, Gillet LC, Bernhardt OM, MacLean B, Rost HL, Tate SA, Tsou CC, Reiter L, Distler U, Rosenberger G, Perez-Riverol Y, Nesvizhskii AI, Aebersold R, Tenzer S (2016). A multicenter study benchmarks software tools for label-free proteome quantification. Nature Biotechnology, 34(11), 1130-1136.