Class for the enzymatic digestion of sequences. More...

#include <OpenMS/CHEMISTRY/EnzymaticDigestion.h>

Inheritance diagram for EnzymaticDigestion:

Collaboration diagram for EnzymaticDigestion:

Public Types
enum	Specificity { SPEC_NONE = 0 , SPEC_SEMI = 1 , SPEC_FULL = 2 , SPEC_UNKNOWN = 3 , SPEC_NOCTERM = 8 , SPEC_NONTERM = 9 , SIZE_OF_SPECIFICITY = 10 }
	when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important More...

Public Member Functions
	EnzymaticDigestion ()
	Default constructor. More...

	EnzymaticDigestion (const EnzymaticDigestion &rhs)
	Copy constructor. More...

EnzymaticDigestion &	operator= (const EnzymaticDigestion &rhs)
	Assignment operator. More...

virtual	~EnzymaticDigestion ()
	Destructor. More...

Size	getMissedCleavages () const
	Returns the number of missed cleavages for the digestion. More...

void	setMissedCleavages (Size missed_cleavages)
	Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used. More...

String	getEnzymeName () const
	Returns the enzyme for the digestion. More...

virtual void	setEnzyme (const DigestionEnzyme *enzyme)
	Sets the enzyme for the digestion. More...

Specificity	getSpecificity () const
	Returns the specificity for the digestion. More...

void	setSpecificity (Specificity spec)
	Sets the specificity for the digestion (default is SPEC_FULL). More...

Size	digestUnmodified (const StringView &sequence, std::vector< StringView > &output, Size min_length=1, Size max_length=0) const
	Performs the enzymatic digestion of an unmodified sequence. More...

Size	digestUnmodified (const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=1, Size max_length=0) const
	Performs the enzymatic digestion of an unmodified sequence. More...

bool	isValidProduct (const String &protein, int pep_pos, int pep_length, bool ignore_missed_cleavages=true) const
	Is the peptide fragment starting at position `pep_pos` with length `pep_length` within the sequence `protein` generated by the current enzyme? More...

Size	countInternalCleavageSites (const String &sequence) const
	Counts the number of internal cleavage sites (missed cleavages) in a protein sequence. More...

bool	filterByMissedCleavages (const String &sequence, const std::function< bool(const Int)> &filter) const
	Filter based on the number of missed cleavages. More...

Static Public Member Functions
static Specificity	getSpecificityByName (const String &name)

Static Public Attributes
static const std::string	NamesOfSpecificity [SIZE_OF_SPECIFICITY]
	Names of the Specificity. More...

static const std::string	NoCleavage
	Name for no cleavage. More...

static const std::string	UnspecificCleavage
	Name for unspecific cleavage. More...

Protected Member Functions
bool	isValidProduct_ (const String &sequence, int pos, int length, bool ignore_missed_cleavages, bool allow_nterm_protein_cleavage, bool allow_random_asp_pro_cleavage) const
	supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_() More...

std::vector< int >	tokenize_ (const String &sequence, int start=0, int end=-1) const
	Digests the sequence using the enzyme's regular expression. More...

Size	digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< StringView > &output, Size min_length=0, Size max_length=-1) const
	Helper function for digestUnmodified() More...

Size	digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=0, Size max_length=-1) const

Size	countMissedCleavages_ (const std::vector< int > &cleavage_positions, Size seq_start, Size seq_end) const
	Counts the number of missed cleavages in a sequence fragment. More...

Protected Attributes
Size	missed_cleavages_
	Number of missed cleavages. More...

const DigestionEnzyme *	enzyme_
	Used enzyme. More...

std::unique_ptr< boost::regex >	re_
	Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_()) More...

Specificity	specificity_
	specificity of enzyme More...

Detailed Description

Class for the enzymatic digestion of sequences.

Digestion can be performed using simple regular expressions, e.g. [KR] | [^P] for trypsin. Also missed cleavages can be modeled, i.e. adjacent peptides are not cleaved due to enzyme malfunction/access restrictions. If n missed cleavages are given, all possible resulting peptides (cleaved and uncleaved) with up to n missed cleavages are returned. Thus no random selection of just n specific missed cleavage sites is performed.

See also: ProteaseDigestion for functionality specific to protein digestion.

Member Enumeration Documentation

◆ Specificity

enum Specificity

when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important

Enumerator
SPEC_NONE	no requirements on start / end
SPEC_SEMI	semi specific, i.e., one of the two cleavage sites must fulfill requirements
SPEC_FULL	fully enzyme specific, e.g., tryptic (ends with KR, AA-before is KR), or peptide is at protein terminal ends
SPEC_UNKNOWN
SPEC_NOCTERM	no requirements on CTerm (currently not supported in the class)
SPEC_NONTERM	no requirements on NTerm (currently not supported in the class)
SIZE_OF_SPECIFICITY

Constructor & Destructor Documentation

◆ EnzymaticDigestion() [1/2]

EnzymaticDigestion ( )

Default constructor.

◆ EnzymaticDigestion() [2/2]

EnzymaticDigestion ( const EnzymaticDigestion & rhs )

Copy constructor.

◆ ~EnzymaticDigestion()

virtual ~EnzymaticDigestion ( )

virtual

Destructor.

Member Function Documentation

◆ countInternalCleavageSites()

Size countInternalCleavageSites ( const String & sequence ) const

Counts the number of internal cleavage sites (missed cleavages) in a protein sequence.

Parameters

sequence Sequence

Returns: Number of internal cleavage sites (= missed cleavages in the sequence)

◆ countMissedCleavages_()

Size countMissedCleavages_	(	const std::vector< int > &	cleavage_positions,
		Size	seq_start,
		Size	seq_end
	)		const

protected

Counts the number of missed cleavages in a sequence fragment.

Parameters

cleavage_positions	Positions of cleavage in protein as obtained from tokenize_()
seq_start	Index into sequence
seq_end	Past-the-end index into sequence

Returns: number of missed cleavages of peptide

◆ digestAfterTokenize_() [1/2]

Size digestAfterTokenize_	(	const std::vector< int > &	fragment_positions,
		const StringView &	sequence,
		std::vector< std::pair< Size, Size >> &	output,
		Size	min_length = `0`,
		Size	max_length = `-1`
	)		const

protected

◆ digestAfterTokenize_() [2/2]

Size digestAfterTokenize_	(	const std::vector< int > &	fragment_positions,
		const StringView &	sequence,
		std::vector< StringView > &	output,
		Size	min_length = `0`,
		Size	max_length = `-1`
	)		const

protected

Helper function for digestUnmodified()

This function implements digestUnmodified() starting from the result of tokenize_(). The separation enables derived classes to modify the result of tokenize_() during the in-silico digestion.

Returns: number of digestion products NOT matching the length restrictions

◆ digestUnmodified() [1/2]

Size digestUnmodified	(	const StringView &	sequence,
		std::vector< std::pair< Size, Size >> &	output,
		Size	min_length = `1`,
		Size	max_length = `0`
	)		const

Performs the enzymatic digestion of an unmodified sequence.

By returning only positions into the original string this is very fast and compared to the StringView output version of this function it is independent of the original sequence. Can be used for matching products to determine e.g. missing ones.

Todo:: could be set of pairs.

Parameters

sequence	Sequence to digest
output	Digestion products as vector of pairs of start and end positions
min_length	Minimal length of reported products
max_length	Maximal length of reported products (0 = no restriction)

Returns: Number of discarded digestion products (which are not matching length restrictions)

◆ digestUnmodified() [2/2]

Size digestUnmodified	(	const StringView &	sequence,
		std::vector< StringView > &	output,
		Size	min_length = `1`,
		Size	max_length = `0`
	)		const

Performs the enzymatic digestion of an unmodified sequence.

By returning only references into the original string this is very fast.

Parameters

sequence	Sequence to digest
output	Digestion products
min_length	Minimal length of reported products
max_length	Maximal length of reported products (0 = no restriction)

Returns: Number of discarded digestion products (which are not matching length restrictions)

◆ filterByMissedCleavages()

bool filterByMissedCleavages	(	const String &	sequence,
		const std::function< bool(const Int)> &	filter
	)		const

Filter based on the number of missed cleavages.

Parameters

sequence	Unmodified (!) amino acid sequence to check.
filter	A predicate that takes as parameter the number of missed cleavages in the sequence and returns true if the sequence should be filtered out.

Returns: Whether the sequence should be filtered out.

Referenced by IDFilter::PeptideDigestionFilter::operator()().

◆ getEnzymeName()

String getEnzymeName ( ) const

Returns the enzyme for the digestion.

◆ getMissedCleavages()

Size getMissedCleavages ( ) const

Returns the number of missed cleavages for the digestion.

◆ getSpecificity()

Specificity getSpecificity ( ) const

Returns the specificity for the digestion.

◆ getSpecificityByName()

static Specificity getSpecificityByName ( const String & name )

static

convert spec string name to enum returns SPEC_UNKNOWN if name is not valid

◆ isValidProduct()

bool isValidProduct	(	const String &	protein,
		int	pep_pos,
		int	pep_length,
		bool	ignore_missed_cleavages = `true`
	)		const

Is the peptide fragment starting at position pep_pos with length pep_length within the sequence protein generated by the current enzyme?

Checks if peptide is a valid digestion product of the enzyme, taking into account specificity and the MC flag provided here.

Parameters

protein	Protein sequence
pep_pos	Starting index of potential peptide
pep_length	Length of potential peptide
ignore_missed_cleavages	Do not compare MC's of potential peptide to the maximum allowed MC's

Returns: True if peptide has correct n/c terminals (according to enzyme, specificity and missed cleavages)

◆ isValidProduct_()

bool isValidProduct_	(	const String &	sequence,
		int	pos,
		int	length,
		bool	ignore_missed_cleavages,
		bool	allow_nterm_protein_cleavage,
		bool	allow_random_asp_pro_cleavage
	)		const

protected

supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_()

◆ operator=()

EnzymaticDigestion& operator= ( const EnzymaticDigestion & rhs )

Assignment operator.

◆ setEnzyme()

virtual void setEnzyme ( const DigestionEnzyme * enzyme )

virtual

Sets the enzyme for the digestion.

Reimplemented in RNaseDigestion.

◆ setMissedCleavages()

void setMissedCleavages ( Size missed_cleavages )

Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used.

Referenced by NucleicAcidSearchEngine::main_().

◆ setSpecificity()

void setSpecificity ( Specificity spec )

Sets the specificity for the digestion (default is SPEC_FULL).

◆ tokenize_()

std::vector<int> tokenize_	(	const String &	sequence,
		int	start = `0`,
		int	end = `-1`
	)		const

protected

Digests the sequence using the enzyme's regular expression.

The resulting split positions include start as first position, but not end. If start is negative, it is reset to zero. If end is negative or beyond sequence's size(), it is set to size(). All returned positions are relative to the full sequence.

Returned positions include start and any positions between start and end matching the regex.

Parameters

sequence	...
start	Start digestion after this point
end	Past-the-end index into `sequence`

Returns: Cleavage positions (this includes start, but not end)

Member Data Documentation

◆ enzyme_

const DigestionEnzyme* enzyme_

protected

Used enzyme.

◆ missed_cleavages_

Size missed_cleavages_

protected

Number of missed cleavages.

◆ NamesOfSpecificity

const std::string NamesOfSpecificity[SIZE_OF_SPECIFICITY]

static

Names of the Specificity.

◆ NoCleavage

const std::string NoCleavage

static

Name for no cleavage.

◆ re_

std::unique_ptr<boost::regex> re_

protected

Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_())

◆ specificity_

Specificity specificity_

protected

specificity of enzyme

◆ UnspecificCleavage

const std::string UnspecificCleavage

static

Name for unspecific cleavage.

Public Types

Public Member Functions

Static Public Member Functions

Static Public Attributes

Protected Member Functions

Protected Attributes

Detailed Description

Member Enumeration Documentation

◆ Specificity

Constructor & Destructor Documentation

◆ EnzymaticDigestion() [1/2]

◆ EnzymaticDigestion() [2/2]

◆ ~EnzymaticDigestion()

Member Function Documentation

◆ countInternalCleavageSites()

◆ countMissedCleavages_()

◆ digestAfterTokenize_() [1/2]

◆ digestAfterTokenize_() [2/2]

◆ digestUnmodified() [1/2]

◆ digestUnmodified() [2/2]

◆ filterByMissedCleavages()

◆ getEnzymeName()

◆ getMissedCleavages()

◆ getSpecificity()

◆ getSpecificityByName()

◆ isValidProduct()

◆ isValidProduct_()

◆ operator=()

◆ setEnzyme()

◆ setMissedCleavages()

◆ setSpecificity()

◆ tokenize_()

Member Data Documentation

◆ enzyme_

◆ missed_cleavages_

◆ NamesOfSpecificity

◆ NoCleavage

◆ re_

◆ specificity_

◆ UnspecificCleavage