MSGF+
The MSGF+ Proteomics Workflow in OmicsPipelines is designed specifically for processing Tandem Mass Tag (TMT)-based experiments. TMT is a multiplexed quantitative proteomics method that allows simultaneous measurement of protein abundance across multiple samples. This workflow currently supports TMT-11, TMT-16, TMT-18 experiments and focuses on quantifying global protein abundance as well as three common post-translational modifications: phosphorylation (ph), ubiquitination (ub), and acetylation (ac). Using the MSGF+ search engine, the pipeline identifies peptides and extracts reporter ion intensities, providing robust quantification for high-throughput proteomics analyses.
Workflow Steps:
-
Sequence Database Preparation: Prepare and index the protein sequence database using MSGF+. This step generates the necessary indexed files for the subsequent search tasks.
-
MASIC Processing (TMT-specific): Extract reporter ion peaks from MS2 spectra using MASIC, generating files that include reporter ion intensities and associated statistics.
-
MSConvert: Convert raw mass spectrometry files to mzML format using MSConvert, ensuring compatibility with downstream search tools.
-
MSGF+ Tryptic Search: Perform a full tryptic search with MSGF+ on the mzML files to generate initial peptide identifications.
-
MSConvert MzRefiner: Recalibrate the mzML files using mass error histograms, refining m/z values to improve search accuracy.
-
PPM Error Characterization: Generate mass error histograms and plots using PPMErrorCharter to assess the quality of the recalibration.
-
MSGF+ Identification Search: Conduct a partial tryptic search with MSGF+ on the recalibrated mzML files to produce final peptide identification results.
-
MzID to TSV Conversion: Convert the final mzID output to a TSV file for easier downstream processing and review.
-
Peptide Hit Results Processing (PHRP): Process the TSV file to generate detailed peptide identification results, including mapping peptides to their corresponding proteins.
-
Optional PTM Analysis (AScore): If applicable, perform PTM analysis with AScore to localize modifications on peptides.
-
PlexedPiper Integration (TMT-specific): For TMT experiments, combine outputs from MASIC, PHRP, and AScore to produce final quantification files, including reporter ion intensity and ratio files
How to run the MSGF+ pipeline
This section provides a detailed guide on running the MSGF+ pipeline within OmicsPipelines for TMT-based proteomics experiments. It covers both the required workflow inputs and the optional advanced customization of each pipeline step. The workflow inputs section explains the essential information needed to run the pipeline, such as the location of raw files, study design, sequence database, and experiment details. This ensures that your data is correctly located and the analysis is properly configured for your specific TMT experiment, which covers global protein abundance and post-translational modifications such as phosphorylation, ubiquitination, and acetylation.
The optional customization section is intended for advanced users who wish to fine-tune each step of the workflow. Here, you can adjust key options—including parameter files, CPU allocation, memory, disk space, and Docker container selections—for each individual process. Each step of the pipeline is powered by an open-source tool with its own specific methods and configuration options. While these settings are customizable, they are entirely optional; the default parameters are optimized for general use, so you may proceed with them if you are not comfortable making modifications.
Below, you will find a detailed breakdown of the workflow inputs followed by a comprehensive list of customizable options for each key step of the pipeline.
Workflow Inputs
Before running the pipeline, you must provide essential information about your experiment. These inputs ensure that OmicsPipelines can locate your raw data, understand your experimental design, and correctly configure the analysis parameters. The required details include the workflow name, locations of your raw files, study design, and sequence database, as well as specifics about your proteomics experiment, quantification method, species, and desired output naming conventions. Let's go one by one:
-
Workflow Name (name): Provide a descriptive name for your workflow. This name helps you easily identify and manage your workflow among multiple runs within OmicsPipelines.
-
Folder Containing Raw Files (folder_raw): Specify a valid S3 or GCS path (e.g.,
s3://your-bucket/path
orgs://your-bucket/path
) where your raw mass spectrometry files are stored. This folder must contain the raw files to be processed by the proteomics pipeline. -
Proteomics Experiment: Select the type of proteomics experiment you want to post-process. Options include global protein abundance as well as analyses focused on phosphorylation, ubiquitination, and acetylation.
-
Quantification Method: Select the quantification method for your experiment. For TMT-based experiments, choose TMT-11 to ensure the pipeline applies the appropriate settings for accurate TMT quantification.
-
Folder for Proteomics Study Design Files (study_design_location): Provide a valid S3 or GCS path where your study design files are stored. These files typically contain essential sample metadata and experimental design details for downstream analysis.
-
Sequence Database Location (fasta_sequence_db): Specify a valid S3 or GCS path for your protein sequence database in FASTA format. You can upload your file here, and the uploader will create an indexed database from your file.
-
Sequence Database Origin: Select the origin of your sequence database. For instance, choose RefSeq if your database is based on the NCBI Reference Sequence Database. This choice helps configure the pipeline parameters for optimal peptide identification.
-
Species: Provide the scientific name of the species from which your samples are derived (e.g., Homo sapiens). This information ensures that species-specific parameters are applied during the analysis.
-
Results File Name Prefix: Enter a prefix for naming your output files. The final output files will typically end in
_ratio.txt (for quantification ratios)
or_RII-peptides.txt
(for reporter ion intensities), which helps organize and identify your results for further analysis.
Additional Configuration Options
In this section, you have the option to customize every individual step of the proteomics workflow. This advanced configuration is entirely optional and is intended for users who are familiar with the underlying methods and open-source tools that power each step. Every component—from the MSGF+ searches to MSConvert, MASIC processing, and beyond—has its own set of configurable parameters. You can adjust settings such as CPU allocation, memory, disk space, and Docker container selections, as well as update parameter files if necessary.
Each step in the workflow employs a different method, and all steps are open source. You can always refer to the documentation and source references for detailed information on how each tool operates. If you are confident in your understanding and need to fine-tune the pipeline to better match your experimental data or computing environment, feel free to adjust these options. Otherwise, the default settings are optimized for general use, and you can safely proceed without any modifications.
For each step, you can adjust several key options, including:
-
Parameter Files: Specify the file (stored on S3 or GCS) that contains tool-specific settings. These files define the algorithm parameters used during the analysis.
-
CPU Allocation: Set the number of CPU cores to dedicate to each step. Increasing CPU resources can speed up processing for computationally intensive tasks.
-
Memory Allocation: Define the amount of RAM (in GB) to assign. Adequate memory is essential to handle large datasets and ensure smooth execution.
-
Disk Space: Allocate the disk space (in GB) required for temporary files and outputs produced during each step.
-
Viewing/Editing Default Parameters: For advanced users, options are available to view and edit the default parameter settings (e.g., “View [Step Name] Parameters (edit at your own risk)”).
Below is a brief overview of the customizable options for each key step:
-
MS-GF+ Partial Tryptic Search:
- Parameter File: Specifies the MSGF+ identification settings (e.g.,
Msgf Identification Parameter
). - Resources: Customize the number of CPUs, memory (in GB), and disk space (in GB).
- Additional Option: View or edit the default MSGF+ Partial Tryptic Search parameters.
- Parameter File: Specifies the MSGF+ identification settings (e.g.,
-
MSGF+ Full Tryptic Search:
- Parameter File: Uses the
Msgf Tryptic Mzrefinery Parameter
file for full tryptic search settings. - Resources: Allocate CPUs, memory, and disk space.
- Additional Option: View the full tryptic search parameters for reference or modification.
- Parameter File: Uses the
-
PPMErrorCharter:
- Resources: Set the number of CPUs, memory (in GB), and disk space (in GB) to generate mass error histograms and plots.
-
mzID to TSV Converter:
- Resources: Define CPUs, memory, and disk space to convert mzID output into a more manageable TSV format.
-
MASIC Processing (TMT-specific):
- Parameter File: Provide the MASIC parameter file (e.g.,
TMT18_10ppm_ReporterTol0.003Da_2021-01-24.xml
). - Resources: Allocate the necessary CPUs, memory, and disk space.
- Parameter File: Provide the MASIC parameter file (e.g.,
-
MSConvert:
- Resources: Specify the number of CPUs, memory, and disk space required for converting raw files to mzML format.
-
MSConvert (MZRefiner filter):
- Resources: Similar to MSConvert, customize the CPUs, memory, and disk space to recalibrate mzML files using MZRefiner.
-
PeptideHitResultsProcessor (PHRP):
- Parameter Files: Input the PHRP parameter files (e.g.,
Phrp Parameter M
,Phrp Parameter N
, andPhrp Parameter T
). - Resources: Adjust CPUs, memory, and disk space.
- Additional Settings: Set the PHRP Synprob and Synpvalue values as needed.
- Parameter Files: Input the PHRP parameter files (e.g.,
-
MSGF+ Process Sequence DB:
- Resources: Allocate CPUs, memory, and disk space for processing and indexing the protein sequence database.
-
PlexedPiper Integration (TMT-specific):
- Settings: Configure options specific to TMT-based experiments, including the type of proteomics experiment (e.g., phosphorylation, ubiquitination, acetylation, or global protein abundance), refine prior settings, and PR Ratio file location.
- Resources: Define CPUs, memory, and disk space.
-
AScore (Optional PTM Analysis):
- Parameter File: Provide the AScore parameter file (e.g.,
AScore_CID_0.5Da_ETD_0.5Da_HCD_0.05Da.xml
). - Resources: Set CPUs, memory, and disk space.
- Parameter File: Provide the AScore parameter file (e.g.,