Tool to estimate the probability of peptide hits to be incorrectly assigned.

potential predecessor tools	→ IDPosteriorErrorProbability →	potential successor tools
MascotAdapter (or other ID engines)	→ IDPosteriorErrorProbability →	ConsensusID

Experimental classes:: This tool has not been tested thoroughly and might behave not as expected!

By default an estimation is performed using the (inverse) Gumbel distribution for incorrectly assigned sequences and a Gaussian distribution for correctly assigned sequences. The probabilities are calculated by using Bayes' law, similar to PeptideProphet. Alternatively, a second Gaussian distribution can be used for incorrectly assigned sequences. At the moment, IDPosteriorErrorProbability is able to handle X! Tandem, Mascot, MyriMatch and OMSSA scores.

No target/decoy information needs to be provided, since the model fits are done on the mixed distribution.

In order to validate the computed probabilities an optional plot output can be generated. There are two parameters for the plot: The scores are plotted in the form of bins. Each bin represents a set of scores in a range of '(highest_score - smallest_score) / number_of_bins' (if all scores have positive values). The midpoint of the bin is the mean of the scores it represents. The parameter 'out_plot' should be used to give the plot a unique name. Two files are created. One with the binned scores and one with all steps of the estimation. If parameter top_hits_only is set, only the top hits of each peptide identification are used for the estimation process. Additionally, if 'top_hits_only' is set, target/decoy information is available and a FalseDiscoveryRate run was performed previously, an additional plot will be generated with target and decoy bins ('out_plot' must not be empty). A peptide hit is assumed to be a target if its q-value is smaller than fdr_for_targets_smaller. The plots are saved as a Gnuplot file. An attempt is made to call Gnuplot, which will create a PDF file containing all steps of the estimation. If this fails, the user has to run Gnuplot manually - or adjust the PATH environment such that Gnuplot can be found and retry.

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

IDPosteriorErrorProbability -- Estimates probabilities for incorrectly assigned peptide sequences and a set
of search engine scores using a mixture model.
Full documentation: http://www.openms.de/doxygen/release/3.2.0/html/TOPP_IDPosteriorErrorProbability.html
Version: 3.2.0 Nov 26 2024, 13:16:38, Revision: 962e60f
To cite OpenMS:
+ Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
IDPosteriorErrorProbability <options>

This tool has algorithm parameters that are not shown here! Please check the ini file for a detailed descript
ion or use the --helphelp option

Options (mandatory options marked with '*'):
-in <file>* Input file (valid formats: 'idXML')
-out <file>* Output file (valid formats: 'idXML')
-out_plot <file> Txt file (if gnuplot is available, a corresponding PDF will be created as well.) (valid
formats: 'txt')
-split_charge The search engine scores are split by charge if this flag is set. Thus, for each charge
state a new model will be computed.
-top_hits_only If set only the top hits of every PeptideIdentification will be used
-ignore_bad_data If set errors will be written but ignored. Useful for pipelines with many datasets where
only a few are bad, but the pipeline should run through.
-prob_correct If set scores will be calculated as '1 - ErrorProbabilities' and can be interpreted as
probabilities for correct identifications.

Common TOPP options:
-ini <file> Use the given TOPP INI file
-threads <n> Sets the number of threads allowed to be used by the TOPP tool (default: '1')
-write_ini <file> Writes the default configuration file
--help Shows options
--helphelp Shows all options (including advanced)

The following configuration subsections are valid:
- fit_algorithm Algorithm parameter subsection

You can write an example INI file using the '-write_ini' option.
Documentation of subsection parameters can be found in the doxygen documentation or the INIFileEditor.
For more information, please consult the online documentation for this tool:
- http://www.openms.de/doxygen/release/3.2.0/html/TOPP_IDPosteriorErrorProbability.html

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+IDPosteriorErrorProbabilityEstimates probabilities for incorrectly assigned peptide sequences and a set of search engine scores using a mixture model.

version3.2.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'IDPosteriorErrorProbability'

in input file input file*.idXML

out output file output file*.idXML

out_plot txt file (if gnuplot is available, a corresponding PDF will be created as well.)output file*.txt

split_chargefalse The search engine scores are split by charge if this flag is set. Thus, for each charge state a new model will be computed.true, false

top_hits_onlyfalse If set only the top hits of every PeptideIdentification will be usedtrue, false

fdr_for_targets_smaller0.05 Only used, when top_hits_only set. Additionally, target/decoy information should be available. The score_type must be q-value from an previous False Discovery Rate run.

ignore_bad_datafalse If set errors will be written but ignored. Useful for pipelines with many datasets where only a few are bad, but the pipeline should run through.true, false

prob_correctfalse If set scores will be calculated as '1 - ErrorProbabilities' and can be interpreted as probabilities for correct identifications.true, false

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue, false

forcefalse Overrides tool-specific checkstrue, false

testfalse Enables the test mode (needed for internal use only)true, false

+++fit_algorithmAlgorithm parameter subsection

number_of_bins100 Number of bins used for visualization. Only needed if each iteration step of the EM-Algorithm will be visualized

incorrectly_assignedGumbel for 'Gumbel', the Gumbel distribution is used to plot incorrectly assigned sequences. For 'Gauss', the Gauss distribution is used.Gumbel, Gauss

max_nr_iterations1000 Bounds the number of iterations for the EM algorithm when convergence is slow.

neg_log_delta6 The negative logarithm of the convergence threshold for the likelihood increase.

outlier_handlingignore_iqr_outliers What to do with outliers:
- ignore_iqr_outliers: ignore outliers outside of 3*IQR from Q1/Q3 for fitting
- set_iqr_to_closest_valid: set IQR-based outliers to the last valid value for fitting
- ignore_extreme_percentiles: ignore everything outside 99th and 1st percentile (also removes equal values like potential censored max values in XTandem)
- none: do nothingignore_iqr_outliers, set_iqr_to_closest_valid, ignore_extreme_percentiles, none

For the parameters of the algorithm section see the algorithms documentation:
fit_algorithm