Compute peptide and protein abundances from annotated feature/consensus maps or from identification results.

potential predecessor tools	$\longrightarrow$ ProteinQuantifier $\longrightarrow$	potential successor tools
IDMapper		external tools e.g. for statistical analysis
FeatureLinkerUnlabeled (or another feature grouping tool)

Reference:
Weisser et al.: An automated pipeline for high-throughput label-free quantitative proteomics (J. Proteome Res., 2013, PMID: 23391308).

Input: featureXML or consensusXML

Quantification is based on the intensity values of the features in the input files. Feature intensities are first accumulated to peptide abundances, according to the peptide identifications annotated to the features/feature groups. Then, abundances of the peptides of a protein are averaged to compute the protein abundance.

The peptide-to-protein step uses the (e.g. 3) most abundant proteotypic peptides per protein to compute the protein abundances. This is a general version of the "top 3 approach" (but only for relative quantification) described in:
Silva et al.: Absolute quantification of proteins by LCMS^E: a virtue of parallel MS acquisition (Mol. Cell. Proteomics, 2006, PMID: 16219938).

Only features/feature groups with unambiguous peptide annotation are used for peptide quantification. It is possible to resolve ambiguities before applying ProteinQuantifier using one of several equivalent mechanisms in OpenMS: IDConflictResolver, ConsensusID (algorithm best), or FileFilter (option id:keep_best_score_id).

Similarly, only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification by default. Peptide/protein IDs from multiple identification runs can be handled, but will not be differentiated (i.e. protein accessions for a peptide will be accumulated over all identification runs). See section "Optional input: Protein inference/grouping results" below for exceptions to this.

Peptides with the same sequence, but with different modifications are quantified separately on the peptide level, but treated as one peptide for the protein quantification (i.e. the contributions of differently-modified variants of the same peptide are accumulated).

Input: idXML

Quantification based on identification results uses spectral counting, i.e. the abundance of each peptide is the number of times that peptide was identified from an MS2 spectrum (considering only the best hit per spectrum). Different identification runs in the input are treated as different samples; this makes it possible to quantify several related samples at once by merging the corresponding idXML files with IDMerger. Depending on the presence of multiple runs, output format and applicable parameters are the same as for featureXML and consensusXML, respectively.

The notes above regarding quantification on the protein level and the treatment of modifications also apply to idXML input. In particular, this means that the settings top 0 and average sum should be used to get the "classical" spectral counting quantification on the protein level (where all identifications of all peptides of a protein are summed up).

Optional input: Protein inference/grouping results

By default only proteotypic peptides (i.e. those matching to exactly one protein) are used for protein quantification. However, this limitation can be overcome: Protein inference results for the whole sample set can be supplied with the protein_groups option (or included in a featureXML input). In that case, the peptide-to-protein references from that file are used (rather than those from in), and groups of indistinguishable proteins will be quantified. Each reported protein quantity then refers to the total for the respective group.

In order for everything to work correctly, it is important that the protein inference results come from the same identifications that were used to annotate the quantitative data. To use inference results from ProteinProphet, convert the protXML to idXML using IDFileConverter. To use results from Fido, simply run FidoAdapter.

More information below the parameter specification.

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

ProteinQuantifier -- Compute peptide and protein abundances
Full documentation: http://www.openms.de/documentation/TOPP_ProteinQuantifier.html
Version: 2.5.0-nightly-2020-03-06 Mar  7 2020, 01:22:16, Revision: 84b1398
To cite OpenMS:
  Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959.

Usage:
  ProteinQuantifier <options>

Options (mandatory options marked with '*'):
  -in <file>*                        Input file (valid formats: 'featureXML', 'consensusXML', 'idXML')
  -protein_groups <file>             Protein inference results for the identification runs that were used to 
                                     annotate the input (e.g. from ProteinProphet via IDFileConverter or Fido
                                     via FidoAdapter).
                                     Information about indistinguishable proteins will be used for protein q
                                     uantification. (valid formats: 'idXML')
  -design <file>                     Input file containing the experimental design (valid formats: 'tsv')
  -out <file>                        Output file for protein abundances (valid formats: 'csv')
  -peptide_out <file>                Output file for peptide abundances (valid formats: 'csv')
  -mztab <file>                      Output file (mzTab) (valid formats: 'mzTab')
                                     
  -top <number>                      Calculate protein abundance from this number of proteotypic peptides 
                                     (most abundant first; '0' for all) (default: '3' min: '0')
  -average <choice>                  Averaging method used to compute protein abundances from peptide abundan
                                     ces (default: 'median' valid: 'median', 'mean', 'weighted_mean', 'sum')
  -include_all                       Include results for proteins with fewer proteotypic peptides than indica
                                     ted by 'top' (no effect if 'top' is 0 or 1)
  -best_charge_and_fraction          Distinguish between fraction and charge states of a peptide. For peptide
                                     s, abundances will be reported separately for each fraction and charge;
                                     for proteins, abundances will be computed based only on the most preval
                                     ent charge observed of each peptide (over all fractions).
                                     By default, abundances are summed over all charge states.

Additional options for consensus maps (and identification results comprising multiple runs):
  -consensus:normalize               Scale peptide abundances so that medians of all samples are equal
  -consensus:fix_peptides            Use the same peptides for protein quantification across all samples.
                                     With 'top 0', all peptides that occur in every sample are considered.
                                     Otherwise ('top N'), the N peptides that occur in the most samples (ind
                                     ependently of each other) are selected,
                                     breaking ties by total abundance (there is no guarantee that the best c
                                     o-ocurring peptides are chosen!).

  -greedy_group_resolution <choice>  Pre-process identifications with greedy resolution of shared peptides 
                                     based on the protein group probabilities. (Only works with an idXML file
                                     given as protein_groups parameter). (default: 'false' valid: 'true',
                                     'false')
  -ratios                            Add the log2 ratios of the abundance values to the output. Format: log_2
                                     (x_0/x_0) <sep> log_2(x_1/x_0) <sep> log_2(x_2/x_0) ...
  -ratiosSILAC                       Add the log2 ratios for a triple SILAC experiment to the output. Only 
                                     applicable to consensus maps of exactly three sub-maps. Format: log_2(he
                                     avy/light) <sep> log_2(heavy/middle) <sep> log_2(middle/light)

Output formatting options:
  -format:separator <sep>            Character(s) used to separate fields; by default, the 'tab' character 
                                     is used
  -format:quoting <method>           Method for quoting of strings: 'none' for no quoting, 'double' for quoti
                                     ng with doubling of embedded quotes,
                                     'escape' for quoting with backslash-escaping of embedded quotes (defaul
                                     t: 'double' valid: 'none', 'double', 'escape')
  -format:replacement <x>            If 'quoting' is 'none', used to replace occurrences of the separator in 
                                     strings before writing (default: '_')

                                     
Common TOPP options:
  -ini <file>                        Use the given TOPP INI file
  -threads <n>                       Sets the number of threads allowed to be used by the TOPP tool (default:
                                     '1')
  -write_ini <file>                  Writes the default configuration file
  --help                             Shows options
  --helphelp                         Shows all options (including advanced)

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+ProteinQuantifierCompute peptide and protein abundances

version2.5.0-nightly-2020-03-06 Version of the tool that generated this parameters file.

++1Instance '1' section for 'ProteinQuantifier'

in Input fileinput file*.featureXML,*.consensusXML,*.idXML

protein_groups Protein inference results for the identification runs that were used to annotate the input (e.g. from ProteinProphet via IDFileConverter or Fido via FidoAdapter).
Information about indistinguishable proteins will be used for protein quantification.input file*.idXML

design input file containing the experimental designinput file*.tsv

out Output file for protein abundancesoutput file*.csv

peptide_out Output file for peptide abundancesoutput file*.csv

mztab Output file (mzTab)output file*.mzTab

top3 Calculate protein abundance from this number of proteotypic peptides (most abundant first; '0' for all)0:∞

averagemedian Averaging method used to compute protein abundances from peptide abundancesmedian,mean,weighted_mean,sum

include_allfalse Include results for proteins with fewer proteotypic peptides than indicated by 'top' (no effect if 'top' is 0 or 1)true,false

best_charge_and_fractionfalse Distinguish between fraction and charge states of a peptide. For peptides, abundances will be reported separately for each fraction and charge;
for proteins, abundances will be computed based only on the most prevalent charge observed of each peptide (over all fractions).
By default, abundances are summed over all charge states.true,false

greedy_group_resolutionfalse Pre-process identifications with greedy resolution of shared peptides based on the protein group probabilities. (Only works with an idXML file given as protein_groups parameter).true,false

ratiosfalse Add the log2 ratios of the abundance values to the output. Format: log_2(x_0/x_0) log_2(x_1/x_0) log_2(x_2/x_0) ...true,false

ratiosSILACfalse Add the log2 ratios for a triple SILAC experiment to the output. Only applicable to consensus maps of exactly three sub-maps. Format: log_2(heavy/light) log_2(heavy/middle) log_2(middle/light)true,false

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue,false

forcefalse Overwrite tool specific checks.true,false

testfalse Enables the test mode (needed for internal use only)true,false

+++consensusAdditional options for consensus maps (and identification results comprising multiple runs)

normalizefalse Scale peptide abundances so that medians of all samples are equaltrue,false

fix_peptidesfalse Use the same peptides for protein quantification across all samples.
With 'top 0', all peptides that occur in every sample are considered.
Otherwise ('top N'), the N peptides that occur in the most samples (independently of each other) are selected,
breaking ties by total abundance (there is no guarantee that the best co-ocurring peptides are chosen!).true,false

+++formatOutput formatting options

separator Character(s) used to separate fields; by default, the 'tab' character is used

quotingdouble Method for quoting of strings: 'none' for no quoting, 'double' for quoting with doubling of embedded quotes,
'escape' for quoting with backslash-escaping of embedded quotesnone,double,escape

replacement_ If 'quoting' is 'none', used to replace occurrences of the separator in strings before writing

Output format

The output files produced by this tool have a table format, with columns as described below:

Protein output (one protein/set of indistinguishable proteins per line):

protein: Protein accession(s) (as in the annotations in the input file; separated by "/" if more than one).
n_proteins: Number of indistinguishable proteins quantified (usually "1").
protein_score: Protein score, e.g. ProteinProphet probability (if available).
n_peptides: Number of proteotypic peptides observed for this protein (or group of indistinguishable proteins) across all samples. Note that not necessarily all of these peptides contribute to the protein abundance (depending on parameter top).
abundance: Computed protein abundance. For consensusXML input, there will be one column per sample ("abundance_1", "abundance_2", etc.).

Peptide output (one peptide or - if best_charge_and_fraction is set - one charge state and fraction of a peptide per line):

peptide: Peptide sequence. Only peptides that occur in unambiguous annotations of features are reported.
protein: Protein accession(s) for the peptide (separated by "/" if more than one).
n_proteins: Number of proteins this peptide maps to. (Same as the number of accessions in the previous column.)
charge: Charge state quantified in this line. "0" (for "all charges") unless best_charge_and_fraction was set.
abundance: Computed abundance for this peptide. If the charge in the preceding column is 0, this is the total abundance of the peptide over all charge states; otherwise, it is only the abundance observed for the indicated charge (in this case, there may be more than one line for the peptide sequence). Again, for consensusXML input, there will be one column per sample ("abundance_1", "abundance_2", etc.). Also for consensusXML, the reported values are already normalized if consensus:normalize was set.

Protein quantification examples

While quantification on the peptide level is fairly straight-forward, a number of options influence quantification on the protein level - especially for consensusXML input. The three parameters top, include_all and consensus:fix_peptides determine which peptides are used to quantify proteins in different samples.

As an example, consider a protein with four proteotypic peptides. Each peptide is detected in a subset of three samples, as indicated in the table below. The peptides are ranked by abundance (1: highest, 4: lowest; assuming for simplicity that the order is the same in all samples).

	sample 1	sample 2	sample 3
peptide 1	X		X
peptide 2	X	X
peptide 3	X	X	X
peptide 4	X	X

Different parameter combinations lead to different quantification scenarios, as shown here:

parameters "*": no effect in this case			peptides used for quantification "(...)": not quantified here because ...			explanation
`top`	`include_all`	`c`.:fix_peptides	sample 1	sample 2	sample 3	explanation
0	*	no	1, 2, 3, 4	2, 3, 4	1, 3	all peptides
1	*	no	1	2	1	single most abundant peptide
2	*	no	1, 2	2, 3	1, 3	two most abundant peptides
3	no	no	1, 2, 3	2, 3, 4	(too few peptides)	three most abundant peptides
3	yes	no	1, 2, 3	2, 3, 4	1, 3	three or fewer most abundant peptides
4	no	*	1, 2, 3, 4	(too few peptides)	(too few peptides)	four most abundant peptides
4	yes	*	1, 2, 3, 4	2, 3, 4	1, 3	four or fewer most abundant peptides
0	*	yes	3	3	3	all peptides present in every sample
1	*	yes	3	3	3	single peptide present in most samples
2	no	yes	1, 3	(peptide 1 missing)	1, 3	two peptides present in most samples
2	yes	yes	1, 3	3	1, 3	two or fewer peptides present in most samples
3	no	yes	1, 2, 3	(peptide 1 missing)	(peptide 2 missing)	three peptides present in most samples
3	yes	yes	1, 2, 3	2, 3	1, 3	three or fewer peptides present in most samples

Further considerations for parameter selection

With best_charge_and_fractions and average, there is a trade-off between comparability of protein abundances within a sample and of abundances for the same protein across different samples.
Setting best_charge_and_fraction may increase reproducibility between samples, but will distort the proportions of protein abundances within a sample. The reason is that ionization properties vary between peptides, but should remain constant across samples. Filtering by charge state can help to reduce the impact of feature detection differences between samples.
For average, there is a qualitative difference between (intensity weighted) mean/median and sum in the effect that missing peptide abundances have (only if include_all is set or top is 0): (intensity weighted) mean and median ignore missing cases, averaging only present values. If low-abundant peptides are not detected in some samples, the computed protein abundances for those samples may thus be too optimistic. sum implicitly treats missing values as zero, so this problem does not occur and comparability across samples is ensured. However, with sum the total number of peptides ("summands") available for a protein may affect the abundances computed for it (depending on top), so results within a sample may become unproportional.