PeptideIndexer

Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.

pot. predecessor tools	$\longrightarrow$ PeptideIndexer $\longrightarrow$	pot. successor tools
IDFilter or any protein/peptide processing tool	$\longrightarrow$ PeptideIndexer $\longrightarrow$	FalseDiscoveryRate

A detailed description of the parameters and functionality is given in PeptideIndexing.

All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string) as a suffix or prefix, respectively (see parameter prefix). For peptides, the possible values are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)

PeptideIndexer supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir (see TOPP for Advanced Users).

Note: Currently mzIdentML (mzid) is not directly supported as an input/output format of this tool. Convert mzid files to/from idXML using IDFileConverter if necessary.; The tool and its parameters are described in detail in the PeptideIndexing documentation. Please make sure you read and understand it before using the tool to avoid suboptimal results or long runtimes!

The command line parameters of this tool are:

PeptideIndexer -- Refreshes the protein references for all peptide hits.
Version: 2.3.0 Jan  9 2018, 17:46:23, Revision: 38ae115

Usage:
  PeptideIndexer <options>

Options (mandatory options marked with '*'):
  -in <file>*                      Input idXML file containing the identifications. (valid formats: 'idXML')
  -fasta <file>*                   Input sequence database in FASTA format. Non-existing relative filenames 
                                   are looked up via 'OpenMS.ini:id_db_dir' (valid formats: 'fasta')
  -out <file>*                     Output idXML file. (valid formats: 'idXML')
  -decoy_string <text>             String that was appended (or prefixed - see 'decoy_string_position' flag 
                                   below) to the accessions in the protein database to indicate decoy protein
                                   s. (default: 'DECOY_')
  -decoy_string_position <choice>  Should the 'decoy_string' be prepended (prefix) or appended (suffix) to 
                                   the protein accession? (default: 'prefix' valid: 'prefix', 'suffix')
  -missing_decoy_action <choice>   Action to take if NO peptide was assigned to a decoy protein (which indica
                                   tes wrong database or decoy string): 'error' (exit with error, no output),
                                   'warn' (exit with success, warning message) (default: 'error' valid: 'err
                                   or', 'warn')
  -write_protein_sequence          If set, the protein sequences are stored as well.
  -write_protein_description       If set, the protein description is stored as well.
  -keep_unreferenced_proteins      If set, protein hits which are not referenced by any peptide are kept.
  -allow_unmatched                 If set, unmatched peptide sequences are allowed. By default (i.e. if this 
                                   flag is not set) the program terminates with an error on unmatched peptide
                                   s.
  -full_tolerant_search            If set, all peptide sequences are matched using tolerant search. Thus pote
                                   ntially more proteins (containing ambiguous amino acids) are associated.
                                   This is much slower!
  -aaa_max <number>                [tolerant search only] Maximal number of ambiguous amino acids (AAAs) allo
                                   wed when matching to a protein database with AAAs. AAAs are 'B', 'Z' and
                                   'X' (default: '4' min: '0')
  -mismatches_max <number>         [tolerant search only] Maximal number of real mismatches (will be used 
                                   after checking for ambiguous AA's (see 'aaa_max' option). In general this
                                   param should only be changed if you want to look for other potential origi
                                   ns of a peptide which might have unknown SNPs or the like. (default: '0'
                                   min: '0')
  -IL_equivalent                   Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equiv
                                   alent (indistinguishable)
  -filter_aaa_proteins             In the tolerant search for matches to proteins with ambiguous amino acids 
                                   (AAAs), rebuild the search database to only consider proteins with AAAs.
                                   This may save time if most proteins don't contain AAAs and if there is a
                                   significant number of peptides that enter the tolerant search.
  -log <text>                      Name of log file (created only when specified)
  -debug <number>                  Sets the debug level (default: '0')

enzyme:
  -enzyme:name <choice>            Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after 
                                   lysine (K) or arginine (R), but not before proline (P). (default: 'Trypsin
                                   ' valid: 'Lys-C/P', 'Lys-N', 'leukocyte elastase', 'proline endopeptidase'
                                   , 'Trypsin/P', 'V8-DE', 'V8-E', 'Alpha-lytic protease', 'Lys-C', 'Asp-N',
                                   'Asp-N_ambic', 'Trypsin', 'glutamyl endopeptidase', '2-iodobenzoate', 'Try
                                   pChymo', 'Asp-N/B', 'unspecific cleavage', 'Chymotrypsin', 'PepsinA', 'Arg
                                   -C', 'CNBr', 'Formic_acid', 'Chymotrypsin/P', 'no cleavage', 'Arg-C/P')
  -enzyme:specificity <choice>     Specificity of the enzyme.
                                   'full': both internal cleavage sites must match.
                                   'semi': one of two internal cleavage sites must match.
                                   'none': allow all peptide hits no matter their context. Therefore, the
                                   enzyme chosen does not play a role here (default: 'full' valid: 'full',
                                   'semi', 'none')

                                   
Common TOPP options:
  -ini <file>                      Use the given TOPP INI file
  -threads <n>                     Sets the number of threads allowed to be used by the TOPP tool (default: 
                                   '1')
  -write_ini <file>                Writes the default configuration file
  --help                           Shows options
  --helphelp                       Shows all options (including advanced)

INI file documentation of this tool:

Legend:

required parameter

advanced parameter

+PeptideIndexerRefreshes the protein references for all peptide hits.

version2.3.0 Version of the tool that generated this parameters file.

++1Instance '1' section for 'PeptideIndexer'

in Input idXML file containing the identifications.input file*.idXML

fasta Input sequence database in FASTA format. Non-existing relative filenames are looked up via 'OpenMS.ini:id_db_dir'input file*.fasta

out Output idXML file.output file*.idXML

decoy_stringDECOY_ String that was appended (or prefixed - see 'decoy_string_position' flag below) to the accessions in the protein database to indicate decoy proteins.

decoy_string_positionprefix Should the 'decoy_string' be prepended (prefix) or appended (suffix) to the protein accession?prefix,suffix

missing_decoy_actionerror Action to take if NO peptide was assigned to a decoy protein (which indicates wrong database or decoy string): 'error' (exit with error, no output), 'warn' (exit with success, warning message)error,warn

write_protein_sequencefalse If set, the protein sequences are stored as well.true,false

write_protein_descriptionfalse If set, the protein description is stored as well.true,false

keep_unreferenced_proteinsfalse If set, protein hits which are not referenced by any peptide are kept.true,false

allow_unmatchedfalse If set, unmatched peptide sequences are allowed. By default (i.e. if this flag is not set) the program terminates with an error on unmatched peptides.true,false

full_tolerant_searchfalse If set, all peptide sequences are matched using tolerant search. Thus potentially more proteins (containing ambiguous amino acids) are associated. This is much slower!true,false

aaa_max4 [tolerant search only] Maximal number of ambiguous amino acids (AAAs) allowed when matching to a protein database with AAAs. AAAs are 'B', 'Z' and 'X'0:∞

mismatches_max0 [tolerant search only] Maximal number of real mismatches (will be used after checking for ambiguous AA's (see 'aaa_max' option). In general this param should only be changed if you want to look for other potential origins of a peptide which might have unknown SNPs or the like.0:∞

IL_equivalentfalse Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as equivalent (indistinguishable)true,false

filter_aaa_proteinsfalse In the tolerant search for matches to proteins with ambiguous amino acids (AAAs), rebuild the search database to only consider proteins with AAAs. This may save time if most proteins don't contain AAAs and if there is a significant number of peptides that enter the tolerant search.true,false

log Name of log file (created only when specified)

debug0 Sets the debug level

threads1 Sets the number of threads allowed to be used by the TOPP tool

no_progressfalse Disables progress logging to command linetrue,false

forcefalse Overwrite tool specific checks.true,false

testfalse Enables the test mode (needed for internal use only)true,false

+++enzyme

nameTrypsin Enzyme which determines valid cleavage sites - e.g. trypsin cleaves after lysine (K) or arginine (R), but not before proline (P).Lys-C/P,Lys-N,leukocyte elastase,proline endopeptidase,Trypsin/P,V8-DE,V8-E,Alpha-lytic protease,Lys-C,Asp-N,Asp-N_ambic,Trypsin,glutamyl endopeptidase,2-iodobenzoate,TrypChymo,Asp-N/B,unspecific cleavage,Chymotrypsin,PepsinA,Arg-C,CNBr,Formic_acid,Chymotrypsin/P,no cleavage,Arg-C/P

specificityfull Specificity of the enzyme.
'full': both internal cleavage sites must match.
'semi': one of two internal cleavage sites must match.
'none': allow all peptide hits no matter their context. Therefore, the enzyme chosen does not play a role herefull,semi,none