OpenMS
PeptideIndexer

Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.

pot. predecessor tools → PeptideIndexer → pot. successor tools
IDFilter or
any protein/peptide processing tool
FalseDiscoveryRate

PeptideIndexer refreshes target/decoy information and mapping of peptides to proteins. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, peptides hitting both target and decoy proteins are counted as target hits.)

PeptideIndexer allows for ambiguous amino acids (B|J|Z|X) in the protein database and peptide sequence.

Enzyme cutting rules and partial specificity are derived from input idXML automatically by default or can be specified explicitly by the user.

All peptide and protein hits are annotated with target/decoy information, using the meta value 'target_decoy'. For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string) as a suffix or prefix, respectively (see parameter prefix). Resulting protein hits appear in the order of the FASTA file, except for orphaned proteins, which will appear first with an empty 'target_decoy' metavalue. Duplicate protein accessions & sequences will not raise a warning, but create multiple hits (PeptideIndexer reads the FASTA file piecewise for efficiency reasons, and thus might not see all accessions & sequences at once).

Peptide hits are annotated with metavalue 'protein_references', and if matched to at least one protein also with metavalue 'target_decoy'. The possible values for 'target_decoy' in peptides are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. If the peptide is unmatched the metavalue is missing.

Runtime: PeptideIndexer is usually very fast (loading and storing the data takes the most time) and search speed can be further improved (linearly) by using more threads. Avoid allowing too many (>=4) ambiguous amino acids if your database contains long stretches of 'X' (exponential search space).

PeptideIndexer supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir. The database is by default derived from the input idXML's metainformation ('auto' setting), but can be specified explicitly.

Note
Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

PeptideIndexer -- Refreshes the protein references for all peptide hits.
Full documentation: http://www.openms.de/doxygen/release/3.3.0/html/TOPP_PeptideIndexer.html
Version: 3.3.0 Dec 21 2024, 15:25:20, Revision: 35c5e65
To cite OpenMS:
 + Pfeuffer, J., Bielow, C., Wein, S. et al.. OpenMS 3 enables reproducible analysis of large-scale mass spec
   trometry data. Nat Methods (2024). doi:10.1038/s41592-024-02197-7.

Usage:
  PeptideIndexer <options>

Options (mandatory options marked with '*'):
  -in <file>*                             Input idXML file containing the identifications. (valid formats: 
                                          'idXML')
  -fasta <file>                           Input sequence database in FASTA format. Leave empty for using the 
                                          same DB as used for the input idXML (this might fail). Non-existing
                                           relative filenames are looked up via 'OpenMS.ini:id_db_dir' (valid
                                           formats: 'fasta')
  -out <file>*                            Output idXML file. (valid formats: 'idXML')
  -decoy_string <text>                    String that was appended (or prefixed - see 'decoy_string_position'
                                           flag below) to the accessions in the protein database to indicate 
                                          decoy proteins. If empty (default), it's determined automatically 
                                          (checking for common terms, both as prefix and suffix).
  -decoy_string_position <choice>         Is the 'decoy_string' prepended (prefix) or appended (suffix) to 
                                          the protein accession? (ignored if decoy_string is empty) (default:
                                           'prefix') (valid: 'prefix', 'suffix')
  -missing_decoy_action <choice>          Action to take if NO peptide was assigned to a decoy protein (which
                                           indicates wrong database or decoy string): 'error' (exit with erro
                                          r, no output), 'warn' (exit with success, warning message), 'silent
                                          ' (no action is taken, not even a warning) (default: 'error') (vali
                                          d: 'error', 'warn', 'silent')
  -write_protein_sequence                 If set, the protein sequences are stored as well.
  -write_protein_description              If set, the protein description is stored as well.
  -keep_unreferenced_proteins             If set, protein hits which are not referenced by any peptide are 
                                          kept.
  -unmatched_action <choice>              If peptide sequences cannot be matched to any protein: 1) raise an 
                                          error; 2) warn (unmatched PepHits will miss target/decoy annotation
                                           with downstream problems); 3) remove the hit. (default: 'error') 
                                          (valid: 'error', 'warn', 'remove')
  -aaa_max <number>                       Maximal number of ambiguous amino acids (AAAs) allowed when matchin
                                          g to a protein database with AAAs. AAAs are 'B', 'J', 'Z' and 'X'. 
                                          (default: '3') (min: '0' max: '10')
  -mismatches_max <number>                Maximal number of mismatched (mm) amino acids allowed when matching
                                           to a protein database. The required runtime is exponential in the 
                                          number of mm's; apply with care. MM's are allowed in addition to 
                                          AAA's. (default: '0') (min: '0' max: '10')
  -IL_equivalent                          Treat the isobaric amino acids isoleucine ('I') and leucine ('L') 
                                          as equivalent (indistinguishable). Also occurrences of 'J' will be 
                                          treated as 'I' thus avoiding ambiguous matching.
  -allow_nterm_protein_cleavage <choice>  Allow the protein N-terminus amino acid to clip. (default: 'true') 
                                          (valid: 'true', 'false')

enzyme:
  -enzyme:name <choice>                   Enzyme which determines valid cleavage sites - e.g. trypsin cleaves
                                           after lysine (K) or arginine (R), but not before proline (P). Defa
                                          ult: deduce from input (default: 'auto') (valid: 'auto', 'proline-e
                                          ndopeptidase/HKR', 'Glu-C+P', 'Lys-C', 'Lys-N', 'Trypsin', 'Arg-C',
                                           'Asp-N_ambic', 'Chymotrypsin', 'Chymotrypsin/P', 'CNBr', 'Formic_a
                                          cid', 'Arg-C/P', 'Asp-N', 'Asp-N/B', 'unspecific cleavage', 'Lys-C/
                                          P', 'PepsinA', 'TrypChymo', 'Trypsin/P', 'V8-DE', 'V8-E', 'leukocyt
                                          e elastase', 'proline endopeptidase', 'glutamyl endopeptidase', 
                                          'Alpha-lytic protease', '2-iodobenzoate', 'iodosobenzoate', 'staphy
                                          lococcal protease/D', 'PepsinA + P', 'cyanogen-bromide', 'Clostripa
                                          in/P', 'elastase-trypsin-chymotrypsin', 'no cleavage')
  -enzyme:specificity <choice>            Specificity of the enzyme. Default: deduce from input.
                                            'full': both internal cleavage sites must match.
                                            'semi': one of two internal cleavage sites must match.
                                            'none': allow all peptide hits no matter their context (enzyme 
                                          is irrelevant). (default: 'auto') (valid: 'auto', 'full', 'semi', 
                                          'none')

                                          
Common TOPP options:
  -ini <file>                             Use the given TOPP INI file
  -threads <n>                            Sets the number of threads allowed to be used by the TOPP tool (def
                                          ault: '1')
  -write_ini <file>                       Writes the default configuration file
  --help                                  Shows options
  --helphelp                              Shows all options (including advanced)

INI file documentation of this tool:

Legend:
required parameter
advanced parameter
+PeptideIndexerRefreshes the protein references for all peptide hits.
version3.3.0 Version of the tool that generated this parameters file.
++1Instance '1' section for 'PeptideIndexer'
in Input idXML file containing the identifications.input file*.idXML
fasta Input sequence database in FASTA format. Leave empty for using the same DB as used for the input idXML (this might fail). Non-existing relative filenames are looked up via 'OpenMS.ini:id_db_dir'input file*.fasta
out Output idXML file.output file*.idXML
decoy_string String that was appended (or prefixed - see 'decoy_string_position' flag below) to the accessions in the protein database to indicate decoy proteins. If empty (default), it's determined automatically (checking for common terms, both as prefix and suffix).
decoy_string_positionprefix Is the 'decoy_string' prepended (prefix) or appended (suffix) to the protein accession? (ignored if decoy_string is empty)prefix, suffix
missing_decoy_actionerror Action to take if NO peptide was assigned to a decoy protein (which indicates wrong database or decoy string): 'error' (exit with error, no output), 'warn' (exit with success, warning message), 'silent' (no action is taken, not even a warning)error, warn, silent
write_protein_sequencefalse If set, the protein sequences are stored as well.true, false
write_protein_descriptionfalse If set, the protein description is stored as well.true, false
keep_unreferenced_proteinsfalse If set, protein hits which are not referenced by any peptide are kept.true, false
unmatched_actionerror If peptide sequences cannot be matched to any protein: 1) raise an error; 2) warn (unmatched PepHits will miss target/decoy annotation with downstream problems); 3) remove the hit.error, warn, remove
aaa_max3 Maximal number of ambiguous amino acids (AAAs) allowed when matching to a protein database with AAAs. AAAs are 'B', 'J', 'Z' and 'X'.0:10
mismatches_max0 Maximal number of mismatched (mm) amino acids allowed when matching to a protein database. The required runtime is exponential in the number of mm's; apply with care. MM's are allowed in addition to AAA's.0:10
IL_equivalentfalse Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equivalent (indistinguishable). Also occurrences of 'J' will be treated as 'I' thus avoiding ambiguous matching.true, false
allow_nterm_protein_cleavagetrue Allow the protein N-terminus amino acid to clip.true, false
log Name of log file (created only when specified)
debug0 Sets the debug level
threads1 Sets the number of threads allowed to be used by the TOPP tool
no_progressfalse Disables progress logging to command linetrue, false
forcefalse Overrides tool-specific checkstrue, false
testfalse Enables the test mode (needed for internal use only)true, false
+++enzyme
nameauto Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after lysine (K) or arginine (R), but not before proline (P). Default: deduce from inputauto, proline-endopeptidase/HKR, Glu-C+P, Lys-C, Lys-N, Trypsin, Arg-C, Asp-N_ambic, Chymotrypsin, Chymotrypsin/P, CNBr, Formic_acid, Arg-C/P, Asp-N, Asp-N/B, unspecific cleavage, Lys-C/P, PepsinA, TrypChymo, Trypsin/P, V8-DE, V8-E, leukocyte elastase, proline endopeptidase, glutamyl endopeptidase, Alpha-lytic protease, 2-iodobenzoate, iodosobenzoate, staphylococcal protease/D, PepsinA + P, cyanogen-bromide, Clostripain/P, elastase-trypsin-chymotrypsin, no cleavage
specificityauto Specificity of the enzyme. Default: deduce from input.
'full': both internal cleavage sites must match.
'semi': one of two internal cleavage sites must match.
'none': allow all peptide hits no matter their context (enzyme is irrelevant).
auto, full, semi, none