Protocol
Published: 20 May 2024

An MSstats workflow for detecting differentially abundant proteins in large-scale data-independent acquisition mass spectrometry experiments with FragPipe processing

Nature Protocols volume 19, pages 2915–2938 (2024)Cite this article

3333 Accesses
12 Altmetric
Metrics details

Subjects

Abstract

Technological advances in mass spectrometry and proteomics have made it possible to perform larger-scale and more-complex experiments. The volume and complexity of the resulting data create major challenges for downstream analysis. In particular, next-generation data-independent acquisition (DIA) experiments enable wider proteome coverage than more traditional targeted approaches but require computational workflows that can manage much larger datasets and identify peptide sequences from complex and overlapping spectral features. Data-processing tools such as FragPipe, DIA-NN and Spectronaut have undergone substantial improvements to process spectral features in a reasonable time. Statistical analysis tools are needed to draw meaningful comparisons between experimental samples, but these tools were also originally designed with smaller datasets in mind. This protocol describes an updated version of MSstats that has been adapted to be compatible with large-scale DIA experiments. A very large DIA experiment, processed with FragPipe, is used as an example to demonstrate different MSstats workflows. The choice of workflow depends on the user’s computational resources. For datasets that are too large to fit into a standard computer’s memory, we demonstrate the use of MSstatsBig, a companion R package to MSstats. The protocol also highlights key decisions that have a major effect on both the results and the processing time of the analysis. The MSstats processing can be expected to take 1–3 h depending on the usage of MSstatsBig. The protocol can be run in the point-and-click graphical user interface MSstatsShiny or implemented with minimal coding expertise in R.

Key points

Technological advances in bottom-up mass spectrometry-based proteomics have resulted in a substantial increase in the volume and complexity of the resulting data, and for comparative studies, large numbers of samples are required to get statistically meaningful results.
MSstats can be used to perform statistical analysis of the data after the peptides and proteins have been identified and quantified. MSstatsBig is a variant specifically designed to manage very large datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of the protocol workflow.**

**Fig. 2: Overview of the plots available in MSstats after data pre-processing and summarization.**

**Fig. 3: Overview of modeling plots available in MSstats.**

**Fig. 4: Example sample size calculation plot produced by MSstats.**

MaxDIA enables library-based and library-free data-independent acquisition proteomics

Article Open access 08 July 2021

Mzion enables deep and precise identification of peptides in data-dependent acquisition proteomics

Article Open access 29 April 2023

A comprehensive spectral assay library to quantify the Escherichia coli proteome by DIA/SWATH-MS

Article Open access 12 November 2020

Data availability

The dataset used in this protocol is freely available at https://pdc.cancer.gov/pdc/study/PDC000200 and MassIVE (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=6847574c13964a1f9482ee4d71f33eb1. The quantification results from FragPipe and the MSstats processed data at each step are available in the MassIVE.quant Reanalysis RMSV000000696.1. Source data are provided with this paper.

Code availability

All analysis scripts to recreate Procedure 2 can be found in the same MassIVE.quant Reanalysis RMSV000000696.1.

References

Shuken, S. R. An introduction to mass spectrometry-based proteomics. J. Proteom. Res. 22, 2151–2171 (2023).
Article CAS Google Scholar
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Article CAS PubMed Google Scholar
Ong, S.-E. & Mann, M. Mass spectrometry–based proteomics turns quantitative. Nat. Chem. Biol. 1, 252–262 (2005).
Article CAS PubMed Google Scholar
Borràs, E. & Sabidó, E. What is targeted proteomics? A concise revision of targeted acquisition and targeted data analysis in mass spectrometry. Proteomics 17, 1700180 (2017).
Article Google Scholar
Mann, M. & Jensen, O. N. Proteomic analysis of post-translational modifications. Nat. Biotechnol. 21, 255–261 (2003).
Article CAS PubMed Google Scholar
Li, Z. et al. Systematic comparison of label-free, metabolic labeling, and isobaric chemical labeling for quantitative proteomics on LTQ Orbitrap Velos. J. Proteome Res. 11, 1582–1590 (2012).
Article CAS PubMed Google Scholar
Poulos, R. C. et al. Strategies to enable large-scale proteomics for reproducible research. Nat. Commun. 11, 3793 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cai, X. et al. PulseDIA: data-independent acquisition mass spectrometry using multi-injection pulsed gas-phase fractionation. J. Proteome Res. 20, 279–288 (2021).
Article CAS PubMed Google Scholar
Krzywinski, M. & Altman, N. Power and sample size. Nat. Methods 10, 1139–1140 (2013).
Article CAS Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).
Article CAS PubMed Google Scholar
Bernhardt, O. M. et al. Spectronaut: A Fast and Efficient Algorithm for MRM-Like Processing of Data Independent Acquisition (SWATH-MS) Data. Presented at Proceedings of the 60th ASMS Conference on Mass Spectrometry and Allied Topics, Vancouver, BC, Canada, (unpublished), https://f1000research.com/posters/1096450 (2012).
MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010).
Article CAS PubMed PubMed Central Google Scholar
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 11, 2301–2319 (2016).
Article CAS PubMed Google Scholar
Röst, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219–223 (2014).
Article PubMed Google Scholar
Zhang, F., Ge, W., Ruan, G., Cai, X. & Guo, T. Data-independent acquisition mass spectrometry-based proteomics and software tools: a glimpse in 2020. Proteomics 20, e1900276 (2020).
Article PubMed Google Scholar
Demichev, V. et al. dia-PASEF data analysis using FragPipe and DIA-NN for deep proteomics of low sample amounts. Nat. Commun. 13, 3944 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yu, F. et al. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat. Commun. 14, 4154 (2023).
Article CAS PubMed PubMed Central Google Scholar
Käll, L. & Vitek, O. Computational mass spectrometry–based proteomics. PLoS Comput. Biol. 7, e1002277 (2011).
Article PubMed PubMed Central Google Scholar
Molloy, M. P., Brzezinski, E. E., Hang, J., McDowell, M. T. & VanBogelen, R. A. Overcoming technical variation and biological variation in quantitative proteomics. Proteomics 3, 1912–1919 (2003).
Article CAS PubMed Google Scholar
Clough, T., Thaminy, S., Ragg, S., Ruedi, A. & Vitek, O. Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinforma. 13, S6 (2012).
Article CAS Google Scholar
Tsai, T.-H. et al. Selection of features with consistent profiles improves relative protein quantification in mass spectrometry experiments. Mol. Cell. Proteom. 19, 944–959 (2020).
Article Google Scholar
Girden, E. R. ANOVA: Repeated Measures (Sage Publications, 1992).
Kohler, D. et al. MSstats version 4.0: statistical analyses of quantitative mass spectrometry-based proteomic experiments with chromatography-based quantificationat scale. J. Proteome Res. 22, 1466–1482 (2023).
Article CAS PubMed PubMed Central Google Scholar
Choi, M. et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics 30, 2524–2526 (2014).
Article CAS PubMed Google Scholar
Goeminne, L. J. E., Sticker, A., Martens, L., Gevaert, K. & Clement, L. MSqRob takes the missing hurdle: uniting intensity- and count-based proteomics. Anal. Chem. 92, 6278–6287 (2020).
Article CAS PubMed Google Scholar
Sticker, A., Goeminne, L., Martens, L. & Clement, L. Robust summarization and inference in proteomewide label-free quantification. Mol. Cell. Proteom. 19, 1209–1219 (2020).
Article Google Scholar
Goeminne, L. J. E., Gevaert, K. & Clement, L. Peptide-level robust ridge regression improves estimation, sensitivity, and specificity in data-dependent quantitative label-free shotgun proteomics. Mol. Cell. Proteom. 15, 657–668 (2016).
Article CAS Google Scholar
Zhu, et al. DEqMS: a method for accurate variance estimation in differential protein expression analysis. Mol. Cell. Proteom. 19, 1047–1057 (2020).
Article CAS Google Scholar
Wolski, W. E. et al. prolfqua: a comprehensive R-package for proteomics differential expression analysis. J. Proteome Res. 22, 1092–1104 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bai, M. et al. LFQ-based peptide and protein intensity differential expression analysis. J. Proteome Res. 22, 2114–2123 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gatto, L. & Vanderaa, C. R Package Version 1.13.1, https://github.com/RforMassSpectrometry/QFeatures (2023).
Simmons, J. P., Nelson, L. D. & Simonsohn, U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366 (2011).
Article PubMed Google Scholar
Kohler, D. et al. MSstatsShiny: a GUI for versatile, scalable, and reproducible statistical analyses of quantitative proteomic experiments. J. Proteome Res. 22, 551–556 (2023).
Article CAS PubMed Google Scholar
Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).
Article CAS PubMed PubMed Central Google Scholar
Leprevost, F. D. V. et al. Philosopher: a versatile toolkit for shotgun proteomics data analysis. Nat. Methods 17, 869–870 (2020).
Article Google Scholar
Kohler, D. et al. MSstatsPTM: statistical relative quantification of posttranslational modifications in bottom-up mass spectrometry-based proteomics. Mol. Cell. Proteom. 22, 100477 (2023).
Article CAS Google Scholar
Huang, T. et al. MSstatsTMT: statistical detection of differentially abundant proteins in experiments with isobaric labeling and multiple mixtures. Mol. Cell. Proteom. 19, 1706–1723 (2020).
Article CAS Google Scholar
Malinovska, L. et al. Proteome-wide structural changes measured with limited proteolysis-mass spectrometry: an advanced protocol for high-throughput applications. Nat. Protoc. 18, 659–682 (2022).
Article PubMed Google Scholar
Richardson, N., et al. Apache/Arrow, https://github.com/apache/arrow/, https://arrow.apache.org/docs/r/ (2023).
Zaharia, M., Xin, R. S., Wendell, P., Das, T. & Armbrust, M. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 56–65 (2016).
Article Google Scholar
Feng, et al. Global analysis of protein structural changes in complex proteomes. Nat. Biotechnol. 32, 1036–1044 (2014).
Article CAS PubMed Google Scholar
Clark, D. J., Dhanasekaran, S. M., Petralia, F., Wang, P. & Zhang, H. Integrated proteogenomic characterization of clear cell renal cell carcinoma. Cell 179, 964–983 (2019).
Article CAS PubMed PubMed Central Google Scholar
Dowle, M. & Srinivasan, A. data.table, https://r-datatable.com, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table (2023).
Venables, W. & Ripley, B. Modern Applied Statistics with S 359–364 (Springer, 2002).

Download references

Acknowledgements

We thank J. Carver for his help in setting up the MassIVE container that allowed us to share the datasets and analysis code for this paper. This work was supported by awards NSF-BIO/DBI-1759736 (to O.V.), NSF-BIO/DBI-1950412 (to O.V.) and NIH-NLM-1R01LM013115 (to O.V.), the Chan-Zuckerberg Foundation (to O.V.) and National Institutes of Health grants R01-GM-094231 and U24-CA271037 (to A.I.N.). M.S. was partially financially supported by the National Science Centre, Poland, grant Preludium 2020/37/N/ST6/04070.

Author information

Authors and Affiliations

Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
Devon Kohler & Olga Vitek
Barnett Institute for Chemical and Biological Analysis, Northeastern University, Boston, MA, USA
Devon Kohler & Olga Vitek
University of Wrocław, Wrocław, Poland
Mateusz Staniak
Department of Pathology, University of Michigan, Ann Arbor, MI, USA
Fengchao Yu & Alexey I. Nesvizhskii
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Alexey I. Nesvizhskii

Authors

Devon Kohler
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz Staniak
View author publications
You can also search for this author in PubMed Google Scholar
Fengchao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Alexey I. Nesvizhskii
View author publications
You can also search for this author in PubMed Google Scholar
Olga Vitek
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.K. analyzed the data in MSstats and wrote the relevant MSstats sections of the manuscript. D.K. and O.V. wrote the introduction for the manuscript. M.S. implemented the methods in MSstatsBig. F.Y. analyzed the data by using FragPipe and wrote the relevant FragPipe sections of the paper. F.Y. and A.I.N. determined the experimental dataset for the manuscript. A.I.N., O.V. and D.K. conceptually developed and scoped the manuscript. All authors provided feedback and edited the manuscript.

Corresponding author

Correspondence to Olga Vitek.

Ethics declarations

Competing interests

A.I.N. and F.Y. receive royalties from the University of Michigan for the sale of MSFragger and IonQuant software licenses to commercial entities. All license transactions are managed by the University of Michigan Innovation Partnerships office, and all proceeds are subject to the university technology transfer policy. The other authors declare no competing interests.

Peer review

Peer review information

Nature Protocols thanks Chu Wang, Witold Wolski and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Supplementary information

Supplementary Information

Supplementary Methods 1 and 2, Figs. 1–9 and Tables 1 and 2

Source data

Source Data Fig. 2

Statistical source data for Fig. 2b–d

Source Data Fig. 3

Statistical source data for Fig. 3a–c

Source Data Fig. 4

Statistical source data for Fig. 4

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kohler, D., Staniak, M., Yu, F. et al. An MSstats workflow for detecting differentially abundant proteins in large-scale data-independent acquisition mass spectrometry experiments with FragPipe processing. Nat Protoc 19, 2915–2938 (2024). https://doi.org/10.1038/s41596-024-01000-3

Download citation

Received: 28 September 2023
Accepted: 11 March 2024
Published: 20 May 2024
Issue Date: October 2024
DOI: https://doi.org/10.1038/s41596-024-01000-3

An MSstats workflow for detecting differentially abundant proteins in large-scale data-independent acquisition mass spectrometry experiments with FragPipe processing

Subjects

Abstract

Key points

Access options

Similar content being viewed by others

MaxDIA enables library-based and library-free data-independent acquisition proteomics

Mzion enables deep and precise identification of peptides in data-dependent acquisition proteomics

A comprehensive spectral assay library to quantify the Escherichia coli proteome by DIA/SWATH-MS

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Related links

Extended data

Supplementary information

Supplementary Information

Source data

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Rights and permissions

About this article

Cite this article

Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform

Search

Quick links

Subjects

Abstract

Key points

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Related links

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links