Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information. More...

#include <OpenMS/ANALYSIS/ID/PeptideIndexing.h>

Inheritance diagram for PeptideIndexing:

Public Types
enum	ExitCodes { EXECUTION_OK, DATABASE_EMPTY, PEPTIDE_IDS_EMPTY, DATABASE_CONTAINS_MULTIPLES, ILLEGAL_PARAMETERS, UNEXPECTED_RESULT }
	Exit codes. More...

Public Member Functions
	PeptideIndexing ()
	Default constructor. More...

virtual	~PeptideIndexing ()
	Default destructor. More...

ExitCodes	run (std::vector< FASTAFile::FASTAEntry > &proteins, std::vector< ProteinIdentification > &prot_ids, std::vector< PeptideIdentification > &pep_ids)
	main method of PeptideIndexing More...

Public Member Functions inherited from DefaultParamHandler
	DefaultParamHandler (const String &name)
	Constructor with name that is displayed in error messages. More...

	DefaultParamHandler (const DefaultParamHandler &rhs)
	Copy constructor. More...

virtual	~DefaultParamHandler ()
	Destructor. More...

virtual DefaultParamHandler &	operator= (const DefaultParamHandler &rhs)
	Assignment operator. More...

virtual bool	operator== (const DefaultParamHandler &rhs) const
	Equality operator. More...

void	setParameters (const Param &param)
	Sets the parameters. More...

const Param &	getParameters () const
	Non-mutable access to the parameters. More...

const Param &	getDefaults () const
	Non-mutable access to the default parameters. More...

const String &	getName () const
	Non-mutable access to the name. More...

void	setName (const String &name)
	Mutable access to the name. More...

const std::vector< String > &	getSubsections () const
	Non-mutable access to the registered subsections. More...

Protected Member Functions
virtual void	updateMembers_ ()
	This method is used to update extra member variables at the end of the setParameters() method. More...

void	writeLog_ (const String &text) const

void	writeDebug_ (const String &text, const Size min_level) const

Protected Member Functions inherited from DefaultParamHandler
void	defaultsToParam_ ()
	Updates the parameters after the defaults have been set in the constructor. More...

Protected Attributes
String	log_file_
	Output stream for log/debug info. More...

std::ofstream	log_

bool	debug_
	debug flag More...

String	decoy_string_

bool	prefix_

String	missing_decoy_action_

String	enzyme_name_

String	enzyme_specificity_

bool	write_protein_sequence_

bool	write_protein_description_

bool	keep_unreferenced_proteins_

bool	allow_unmatched_

bool	full_tolerant_search_

bool	IL_equivalent_

Size	aaa_max_

UInt	mismatches_max_

bool	filter_aaa_proteins_

Protected Attributes inherited from DefaultParamHandler
Param	param_
	Container for current parameters. More...

Param	defaults_
	Container for default parameters. This member should be filled in the constructor of derived classes! More...

std::vector< String >	subsections_
	Container for registered subsections. This member should be filled in the constructor of derived classes! More...

String	error_name_
	Name that is displayed in error messages during the parameter checking. More...

bool	check_defaults_
	If this member is set to false no checking if parameters in done;. More...

bool	warn_empty_defaults_
	If this member is set to false no warning is emitted when defaults are empty;. More...

Detailed Description

Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information.

All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string) as a suffix or prefix, respectively (see parameter prefix). For peptides, the possible values are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)

Note: Make sure that your protein names in the database contain a correctly formatted decoy string. This can be ensured by using DecoyDatabase. If the decoy identifier is not recognized successfully all proteins will be assumed to stem from the target-part of the query.
E.g., "sw|P33354_REV|YEHR_ECOLI Uncharacterized lipop..." is invalid, since the tool has no knowledge of how SwissProt entries are build up. A correct identifier could be "rev_sw|P33354|YEHR_ECOLI Uncharacterized li ..." or "sw|P33354|YEHR_ECOLI_rev Uncharacterized li", depending on whether you are using prefix annotation or not.
This tool will also report some helpful target/decoy statistics when it is done.

By default this tool will fail if an unmatched peptide occurs, i.e. if the database does not contain the corresponding protein. You can force it to return successfully in this case by using the flag allow_unmatched.

Some search engines (such as Mascot) will replace ambiguous amino acids ('B', 'Z', and 'X') in the protein database with unambiguous amino acids in the reported peptides, e.g. exchange 'X' with 'H'. This will cause such peptides to not be found by exactly matching their sequences to the database. However, we can recover these cases by using tolerant search (done automatically).

Two search modes are available:

exact: [default mode] Peptide sequences require exact match in protein database. If at least one protein hit is found, no tolerant search is used for this peptide. If no protein for this peptide can be found, tolerant matching is automatically used for this peptide.
tolerant: Allow ambiguous amino acids in protein sequence, e.g., 'M' in peptide will match 'X' in protein. This mode might yield more protein hits for some peptides (those that contain ambiguous amino acids). Tolerant search also allows for real sequence mismatches (see 'mismatches_max'), in case you want to find related proteins which might be the origin of a peptide if it had a SNP for example. Runtime increase is moderate when allowing a single mismatch, but rises drastically for two or more.

The exact mode is much faster (about 10 times) and consumes less memory (about 2.5 times), but might fail to report a few protein hits with ambiguous amino acids for some peptides. Usually these proteins are putative, however. The exact mode also supports usage of multiple threads (threads option) to speed up computation even further, at the cost of some memory. If tolerant searching needs to be done for unassigned peptides, the latter will consume the major share of the runtime. Independent of whether exact or tolerant search is used, we require ambiguous amino acids in peptide sequences to match exactly in the protein DB (i.e. 'X' in a peptide only matches 'X' in the database).

Leucine/Isoleucine: Further complications can arise due to the presence of the isobaric amino acids isoleucine ('I') and leucine ('L') in protein sequences. Since the two have the exact same chemical composition and mass, they generally cannot be distinguished by mass spectrometry. If a peptide containing 'I' was reported as a match for a spectrum, a peptide containing 'L' instead would be an equally good match (and vice versa). To account for this inherent ambiguity, setting the flag IL_equivalent causes 'I' and 'L' to be considered as indistinguishable.
For example, if the sequence "PEPTIDE" (matching "Protein1") was identified as a search hit, but the database additionally contained "PEPTLDE" (matching "Protein2"), running PeptideIndexer with the IL_equivalent option would report both "Protein1" and "Protein2" as accessions for "PEPTIDE". (This is independent of the error-tolerant search controlled by full_tolerant_search and aaa_max.)

Enzyme specificity: Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (i.e. the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P]. We make one exception for peptides starting at the second amino acid of a protein if the first amino acid of that protein is methionine (M), which is usually cleaved off in vivo. For example, the two peptides AAAR and MAAAR would both match a protein starting with MAAAR. You can relax the requirements further by choosing semi-tryptic (only one of two "internal" termini must match requirements) or none (essentially allowing all hits, no matter their context).

Member Enumeration Documentation

◆ ExitCodes

enum ExitCodes

Exit codes.

Enumerator
EXECUTION_OK
DATABASE_EMPTY
PEPTIDE_IDS_EMPTY
DATABASE_CONTAINS_MULTIPLES
ILLEGAL_PARAMETERS
UNEXPECTED_RESULT

Constructor & Destructor Documentation

◆ PeptideIndexing()

PeptideIndexing ( )

Default constructor.

◆ ~PeptideIndexing()

virtual ~PeptideIndexing ( )

virtual

Default destructor.

Member Function Documentation

◆ run()

ExitCodes run	(	std::vector< FASTAFile::FASTAEntry > &	proteins,
		std::vector< ProteinIdentification > &	prot_ids,
		std::vector< PeptideIdentification > &	pep_ids
	)

main method of PeptideIndexing

Referenced by TOPPOpenPepXLLF::main_(), TOPPOpenPepXL::main_(), and RNPxlSearch::main_().

◆ updateMembers_()

virtual void updateMembers_ ( )

protectedvirtual

This method is used to update extra member variables at the end of the setParameters() method.

Also call it at the end of the derived classes' copy constructor and assignment operator.

The default implementation is empty.

Reimplemented from DefaultParamHandler.

◆ writeDebug_()

void writeDebug_	(	const String &	text,
		const Size	min_level
	)		const

protected

◆ writeLog_()

void writeLog_ ( const String & text ) const

protected

Member Data Documentation

◆ aaa_max_

Size aaa_max_

protected

◆ allow_unmatched_

bool allow_unmatched_

protected

◆ debug_

bool debug_

protected

debug flag

◆ decoy_string_

String decoy_string_

protected

◆ enzyme_name_

String enzyme_name_

protected

◆ enzyme_specificity_

String enzyme_specificity_

protected

◆ filter_aaa_proteins_

bool filter_aaa_proteins_

protected

◆ full_tolerant_search_

bool full_tolerant_search_

protected

◆ IL_equivalent_

bool IL_equivalent_

protected

◆ keep_unreferenced_proteins_

bool keep_unreferenced_proteins_

protected

◆ log_

std::ofstream log_

mutableprotected

◆ log_file_

String log_file_

protected

Output stream for log/debug info.

◆ mismatches_max_

UInt mismatches_max_

protected

◆ missing_decoy_action_

String missing_decoy_action_

protected

◆ prefix_

bool prefix_

protected

◆ write_protein_description_

bool write_protein_description_

protected

◆ write_protein_sequence_

bool write_protein_sequence_

protected

Public Types

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Member Enumeration Documentation

◆ ExitCodes

Constructor & Destructor Documentation

◆ PeptideIndexing()

◆ ~PeptideIndexing()

Member Function Documentation

◆ run()

◆ updateMembers_()

◆ writeDebug_()

◆ writeLog_()

Member Data Documentation

◆ aaa_max_

◆ allow_unmatched_

◆ debug_

◆ decoy_string_

◆ enzyme_name_

◆ enzyme_specificity_

◆ filter_aaa_proteins_

◆ full_tolerant_search_

◆ IL_equivalent_

◆ keep_unreferenced_proteins_

◆ log_

◆ log_file_

◆ mismatches_max_

◆ missing_decoy_action_

◆ prefix_

◆ write_protein_description_

◆ write_protein_sequence_