Home  · Classes  · Annotated Classes  · Modules  · Members  · Namespaces  · Related Pages
Public Types | Public Member Functions | Protected Member Functions | Protected Attributes | List of all members
PeptideIndexing Class Reference

Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information. More...

#include <OpenMS/ANALYSIS/ID/PeptideIndexing.h>

Inheritance diagram for PeptideIndexing:
DefaultParamHandler

Public Types

enum  ExitCodes {
  EXECUTION_OK, DATABASE_EMPTY, PEPTIDE_IDS_EMPTY, DATABASE_CONTAINS_MULTIPLES,
  ILLEGAL_PARAMETERS, UNEXPECTED_RESULT
}
 Exit codes. More...
 

Public Member Functions

 PeptideIndexing ()
 Default constructor. More...
 
virtual ~PeptideIndexing ()
 Default destructor. More...
 
ExitCodes run (std::vector< FASTAFile::FASTAEntry > &proteins, std::vector< ProteinIdentification > &prot_ids, std::vector< PeptideIdentification > &pep_ids)
 main method of PeptideIndexing More...
 
- Public Member Functions inherited from DefaultParamHandler
 DefaultParamHandler (const String &name)
 Constructor with name that is displayed in error messages. More...
 
 DefaultParamHandler (const DefaultParamHandler &rhs)
 Copy constructor. More...
 
virtual ~DefaultParamHandler ()
 Destructor. More...
 
virtual DefaultParamHandleroperator= (const DefaultParamHandler &rhs)
 Assignment operator. More...
 
virtual bool operator== (const DefaultParamHandler &rhs) const
 Equality operator. More...
 
void setParameters (const Param &param)
 Sets the parameters. More...
 
const ParamgetParameters () const
 Non-mutable access to the parameters. More...
 
const ParamgetDefaults () const
 Non-mutable access to the default parameters. More...
 
const StringgetName () const
 Non-mutable access to the name. More...
 
void setName (const String &name)
 Mutable access to the name. More...
 
const std::vector< String > & getSubsections () const
 Non-mutable access to the registered subsections. More...
 

Protected Member Functions

virtual void updateMembers_ ()
 This method is used to update extra member variables at the end of the setParameters() method. More...
 
void writeLog_ (const String &text) const
 
void writeDebug_ (const String &text, const Size min_level) const
 
- Protected Member Functions inherited from DefaultParamHandler
void defaultsToParam_ ()
 Updates the parameters after the defaults have been set in the constructor. More...
 

Protected Attributes

String log_file_
 Output stream for log/debug info. More...
 
std::ofstream log_
 
bool debug_
 debug flag More...
 
String decoy_string_
 
bool prefix_
 
String missing_decoy_action_
 
String enzyme_name_
 
String enzyme_specificity_
 
bool write_protein_sequence_
 
bool write_protein_description_
 
bool keep_unreferenced_proteins_
 
bool allow_unmatched_
 
bool full_tolerant_search_
 
bool IL_equivalent_
 
Size aaa_max_
 
UInt mismatches_max_
 
bool filter_aaa_proteins_
 
- Protected Attributes inherited from DefaultParamHandler
Param param_
 Container for current parameters. More...
 
Param defaults_
 Container for default parameters. This member should be filled in the constructor of derived classes! More...
 
std::vector< Stringsubsections_
 Container for registered subsections. This member should be filled in the constructor of derived classes! More...
 
String error_name_
 Name that is displayed in error messages during the parameter checking. More...
 
bool check_defaults_
 If this member is set to false no checking if parameters in done;. More...
 
bool warn_empty_defaults_
 If this member is set to false no warning is emitted when defaults are empty;. More...
 

Detailed Description

Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information.

All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string) as a suffix or prefix, respectively (see parameter prefix). For peptides, the possible values are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)

Note
Make sure that your protein names in the database contain a correctly formatted decoy string. This can be ensured by using DecoyDatabase. If the decoy identifier is not recognized successfully all proteins will be assumed to stem from the target-part of the query.
E.g., "sw|P33354_REV|YEHR_ECOLI Uncharacterized lipop..." is invalid, since the tool has no knowledge of how SwissProt entries are build up. A correct identifier could be "rev_sw|P33354|YEHR_ECOLI Uncharacterized li ..." or "sw|P33354|YEHR_ECOLI_rev Uncharacterized li", depending on whether you are using prefix annotation or not.
This tool will also report some helpful target/decoy statistics when it is done.

By default this tool will fail if an unmatched peptide occurs, i.e. if the database does not contain the corresponding protein. You can force it to return successfully in this case by using the flag allow_unmatched.

Some search engines (such as Mascot) will replace ambiguous amino acids ('B', 'Z', and 'X') in the protein database with unambiguous amino acids in the reported peptides, e.g. exchange 'X' with 'H'. This will cause such peptides to not be found by exactly matching their sequences to the database. However, we can recover these cases by using tolerant search (done automatically).

Two search modes are available:

The exact mode is much faster (about 10 times) and consumes less memory (about 2.5 times), but might fail to report a few protein hits with ambiguous amino acids for some peptides. Usually these proteins are putative, however. The exact mode also supports usage of multiple threads (threads option) to speed up computation even further, at the cost of some memory. If tolerant searching needs to be done for unassigned peptides, the latter will consume the major share of the runtime. Independent of whether exact or tolerant search is used, we require ambiguous amino acids in peptide sequences to match exactly in the protein DB (i.e. 'X' in a peptide only matches 'X' in the database).

Leucine/Isoleucine: Further complications can arise due to the presence of the isobaric amino acids isoleucine ('I') and leucine ('L') in protein sequences. Since the two have the exact same chemical composition and mass, they generally cannot be distinguished by mass spectrometry. If a peptide containing 'I' was reported as a match for a spectrum, a peptide containing 'L' instead would be an equally good match (and vice versa). To account for this inherent ambiguity, setting the flag IL_equivalent causes 'I' and 'L' to be considered as indistinguishable.
For example, if the sequence "PEPTIDE" (matching "Protein1") was identified as a search hit, but the database additionally contained "PEPTLDE" (matching "Protein2"), running PeptideIndexer with the IL_equivalent option would report both "Protein1" and "Protein2" as accessions for "PEPTIDE". (This is independent of the error-tolerant search controlled by full_tolerant_search and aaa_max.)

Enzyme specificity: Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (i.e. the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P]. We make one exception for peptides starting at the second amino acid of a protein if the first amino acid of that protein is methionine (M), which is usually cleaved off in vivo. For example, the two peptides AAAR and MAAAR would both match a protein starting with MAAAR. You can relax the requirements further by choosing semi-tryptic (only one of two "internal" termini must match requirements) or none (essentially allowing all hits, no matter their context).

Member Enumeration Documentation

◆ ExitCodes

enum ExitCodes

Exit codes.

Enumerator
EXECUTION_OK 
DATABASE_EMPTY 
PEPTIDE_IDS_EMPTY 
DATABASE_CONTAINS_MULTIPLES 
ILLEGAL_PARAMETERS 
UNEXPECTED_RESULT 

Constructor & Destructor Documentation

◆ PeptideIndexing()

Default constructor.

◆ ~PeptideIndexing()

virtual ~PeptideIndexing ( )
virtual

Default destructor.

Member Function Documentation

◆ run()

ExitCodes run ( std::vector< FASTAFile::FASTAEntry > &  proteins,
std::vector< ProteinIdentification > &  prot_ids,
std::vector< PeptideIdentification > &  pep_ids 
)

◆ updateMembers_()

virtual void updateMembers_ ( )
protectedvirtual

This method is used to update extra member variables at the end of the setParameters() method.

Also call it at the end of the derived classes' copy constructor and assignment operator.

The default implementation is empty.

Reimplemented from DefaultParamHandler.

◆ writeDebug_()

void writeDebug_ ( const String text,
const Size  min_level 
) const
protected

◆ writeLog_()

void writeLog_ ( const String text) const
protected

Member Data Documentation

◆ aaa_max_

Size aaa_max_
protected

◆ allow_unmatched_

bool allow_unmatched_
protected

◆ debug_

bool debug_
protected

debug flag

◆ decoy_string_

String decoy_string_
protected

◆ enzyme_name_

String enzyme_name_
protected

◆ enzyme_specificity_

String enzyme_specificity_
protected

◆ filter_aaa_proteins_

bool filter_aaa_proteins_
protected

◆ full_tolerant_search_

bool full_tolerant_search_
protected

◆ IL_equivalent_

bool IL_equivalent_
protected

◆ keep_unreferenced_proteins_

bool keep_unreferenced_proteins_
protected

◆ log_

std::ofstream log_
mutableprotected

◆ log_file_

String log_file_
protected

Output stream for log/debug info.

◆ mismatches_max_

UInt mismatches_max_
protected

◆ missing_decoy_action_

String missing_decoy_action_
protected

◆ prefix_

bool prefix_
protected

◆ write_protein_description_

bool write_protein_description_
protected

◆ write_protein_sequence_

bool write_protein_sequence_
protected

OpenMS / TOPP release 2.3.0 Documentation generated on Tue Jan 9 2018 18:22:11 using doxygen 1.8.13