Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.

pot. predecessor tools	$\longrightarrow$ PeptideIndexer $\longrightarrow$	pot. successor tools
IDFilter or any protein/peptide processing tool	$\longrightarrow$ PeptideIndexer $\longrightarrow$	FalseDiscoveryRate

PeptideIndexer refreshes target/decoy information and mapping of peptides to proteins. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)

PeptideIndexer allows for ambiguous amino acids (B|J|Z|X) in the protein database, but not in the peptide sequences. For the latter only I/L can be treated as equivalent (see 'IL_equivalent' flag), but 'J' is not allowed.

Enzyme cutting rules and partial specificity can be specified (derived from input idXML automatically by default).

Resulting protein hits appear in the order of the FASTA file, except for orphaned proteins, which will appear first with an empty target_decoy metavalue. Duplicate protein accessions & sequences will not raise a warning, but create multiple hits (PeptideIndexer scans over the FASTA file once for efficiency reasons, and thus might not see all accessions & sequences at once).

All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string) as a suffix or prefix, respectively (see parameter prefix).

Peptide hits are annotated with metavalue 'protein_references', and if matched to at least one protein also with metavalue 'target_decoy'. The possible values for 'target_decoy' are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The metavalue is not present, if the peptide is unmatched.

Runtime: PeptideIndexer is usually very fast (loading and storing the data takes the most time) and search speed can be further improved (linearly), but using more threads. Avoid allowing too many (>=4) ambiguous amino acids if your database contains long stretches of 'X' (exponential search space).

PeptideIndexer supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir (see TOPP for Advanced Users). The database is by default derived from the input idXML's metainformation ('auto' setting), but can be specified explicitly.

Further details can be found in the underlying OpenMS::PeptideIndexing implementation.

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.

The command line parameters of this tool are:

PeptideIndexer -- Refreshes the protein references for all peptide hits.
Full documentation: http://www.openms.de/documentation/TOPP_PeptideIndexer.html
Version: 2.6.0 Sep 30 2020, 12:54:34, Revision: c26f752
To cite OpenMS:
  Rost HL, Sachsenberg T, Aiche S, Bielow C et al.. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Meth. 2016; 13, 9: 741-748. doi:10.1038/nmeth.3959.

Usage:
  PeptideIndexer <options>

Options (mandatory options marked with '*'):
  -in <file>*                      Input idXML file containing the identifications. (valid formats: 'idXML')
  -fasta <file>                    Input sequence database in FASTA format. Leave empty for using the same 
                                   DB as used for the input idXML (this might fail). Non-existing relative
                                   filenames are looked up via 'OpenMS.ini:id_db_dir' (valid formats: 'fasta'
                                   )
  -out <file>*                     Output idXML file. (valid formats: 'idXML')
  -decoy_string <text>             String that was appended (or prefixed - see 'decoy_string_position' flag 
                                   below) to the accessions in the protein database to indicate decoy protein
                                   s. If empty (default), it's determined automatically (checking for common
                                   terms, both as prefix and suffix).
  -decoy_string_position <choice>  Is the 'decoy_string' prepended (prefix) or appended (suffix) to the prote
                                   in accession? (ignored if decoy_string is empty) (default: 'prefix' valid:
                                   'prefix', 'suffix')
  -missing_decoy_action <choice>   Action to take if NO peptide was assigned to a decoy protein (which indica
                                   tes wrong database or decoy string): 'error' (exit with error, no output),
                                   'warn' (exit with success, warning message), 'silent' (no action is taken
                                   , not even a warning) (default: 'error' valid: 'error', 'warn', 'silent')
  -write_protein_sequence          If set, the protein sequences are stored as well.
  -write_protein_description       If set, the protein description is stored as well.
  -keep_unreferenced_proteins      If set, protein hits which are not referenced by any peptide are kept.
  -unmatched_action <choice>       If peptide sequences cannot be matched to any protein: 1) raise an error; 
                                   2) warn (unmatched PepHits will miss target/decoy annotation with downstre
                                   am problems); 3) remove the hit. (default: 'error' valid: 'error', 'warn',
                                   'remove')
  -aaa_max <number>                Maximal number of ambiguous amino acids (AAAs) allowed when matching to a 
                                   protein database with AAAs. AAAs are B, J, Z and X! (default: '3' min:
                                   '0' max: '10')
  -mismatches_max <number>         Maximal number of mismatched (mm) amino acids allowed when matching to a 
                                   protein database. The required runtime is exponential in the number of
                                   mm's; apply with care. MM's are allowed in addition to AAA's. (default:
                                   '0' min: '0' max: '10')
  -IL_equivalent                   Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equiv
                                   alent (indistinguishable). Also occurences of 'J' will be treated as 'I'
                                   thus avoiding ambiguous matching.

enzyme:
  -enzyme:name <choice>            Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after 
                                   lysine (K) or arginine (R), but not before proline (P). Default: deduce
                                   from input (default: 'auto' valid: 'auto', 'Asp-N', 'Asp-N/B', 'Asp-N_ambi
                                   c', 'Chymotrypsin', 'Chymotrypsin/P', 'CNBr', 'Formic_acid', 'Lys-C', 'Lys
                                   -N', 'Lys-C/P', 'PepsinA', 'TrypChymo', 'V8-DE', 'Trypsin/P', 'V8-E', 'Alp
                                   ha-lytic protease', 'leukocyte elastase', 'proline endopeptidase', 'iodoso
                                   benzoate', 'glutamyl endopeptidase', '2-iodobenzoate', 'staphylococcal
                                   protease/D', 'proline-endopeptidase/HKR', 'Glu-C+P', 'PepsinA + P', 'cyano
                                   gen-bromide', 'Clostripain/P', 'Arg-C/P', 'Trypsin', 'Arg-C', 'elastase-tr
                                   ypsin-chymotrypsin', 'no cleavage', 'unspecific cleavage')
  -enzyme:specificity <choice>     Specificity of the enzyme. Default: deduce from input.
                                   'full': both internal cleavage sites must match.
                                   'semi': one of two internal cleavage sites must match.
                                   'none': allow all peptide hits no matter their context (enzyme is irrel
                                   evant). (default: 'auto' valid: 'auto', 'full', 'semi', 'none')

                                   
Common TOPP options:
  -ini <file>                      Use the given TOPP INI file
  -threads <n>                     Sets the number of threads allowed to be used by the TOPP tool (default: 
                                   '1')
  -write_ini <file>                Writes the default configuration file
  --help                           Shows options
  --helphelp                       Shows all options (including advanced)

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+PeptideIndexerRefreshes the protein references for all peptide hits.

version2.6.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'PeptideIndexer'

in Input idXML file containing the identifications.input file*.idXML

fasta Input sequence database in FASTA format. Leave empty for using the same DB as used for the input idXML (this might fail). Non-existing relative filenames are looked up via 'OpenMS.ini:id_db_dir'input file*.fasta

out Output idXML file.output file*.idXML

decoy_string String that was appended (or prefixed - see 'decoy_string_position' flag below) to the accessions in the protein database to indicate decoy proteins. If empty (default), it's determined automatically (checking for common terms, both as prefix and suffix).

decoy_string_positionprefix Is the 'decoy_string' prepended (prefix) or appended (suffix) to the protein accession? (ignored if decoy_string is empty)prefix,suffix

missing_decoy_actionerror Action to take if NO peptide was assigned to a decoy protein (which indicates wrong database or decoy string): 'error' (exit with error, no output), 'warn' (exit with success, warning message), 'silent' (no action is taken, not even a warning)error,warn,silent

write_protein_sequencefalse If set, the protein sequences are stored as well.true,false

write_protein_descriptionfalse If set, the protein description is stored as well.true,false

keep_unreferenced_proteinsfalse If set, protein hits which are not referenced by any peptide are kept.true,false

unmatched_actionerror If peptide sequences cannot be matched to any protein: 1) raise an error; 2) warn (unmatched PepHits will miss target/decoy annotation with downstream problems); 3) remove the hit.error,warn,remove

aaa_max3 Maximal number of ambiguous amino acids (AAAs) allowed when matching to a protein database with AAAs. AAAs are B, J, Z and X!0:10

mismatches_max0 Maximal number of mismatched (mm) amino acids allowed when matching to a protein database. The required runtime is exponential in the number of mm's; apply with care. MM's are allowed in addition to AAA's.0:10

IL_equivalentfalse Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equivalent (indistinguishable). Also occurences of 'J' will be treated as 'I' thus avoiding ambiguous matching.true,false

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue,false

forcefalse Overrides tool-specific checkstrue,false

testfalse Enables the test mode (needed for internal use only)true,false

+++enzyme

nameauto Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after lysine (K) or arginine (R), but not before proline (P). Default: deduce from inputauto,Asp-N,Asp-N/B,Asp-N_ambic,Chymotrypsin,Chymotrypsin/P,CNBr,Formic_acid,Lys-C,Lys-N,Lys-C/P,PepsinA,TrypChymo,V8-DE,Trypsin/P,V8-E,Alpha-lytic protease,leukocyte elastase,proline endopeptidase,iodosobenzoate,glutamyl endopeptidase,2-iodobenzoate,staphylococcal protease/D,proline-endopeptidase/HKR,Glu-C+P,PepsinA + P,cyanogen-bromide,Clostripain/P,Arg-C/P,Trypsin,Arg-C,elastase-trypsin-chymotrypsin,no cleavage,unspecific cleavage

specificityauto Specificity of the enzyme. Default: deduce from input.
'full': both internal cleavage sites must match.
'semi': one of two internal cleavage sites must match.
'none': allow all peptide hits no matter their context (enzyme is irrelevant).auto,full,semi,none