OpenMS
EnzymaticDigestion Class Reference

Class for the enzymatic digestion of sequences. More...

#include <OpenMS/CHEMISTRY/EnzymaticDigestion.h>

Inheritance diagram for EnzymaticDigestion:
[legend]
Collaboration diagram for EnzymaticDigestion:
[legend]

Public Types

enum  Specificity {
  SPEC_NONE = 0 , SPEC_SEMI = 1 , SPEC_FULL = 2 , SPEC_UNKNOWN = 3 ,
  SPEC_NOCTERM = 8 , SPEC_NONTERM = 9 , SIZE_OF_SPECIFICITY = 10
}
 when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important More...
 

Public Member Functions

 EnzymaticDigestion ()
 Default constructor. More...
 
 EnzymaticDigestion (const EnzymaticDigestion &rhs)
 Copy constructor. More...
 
EnzymaticDigestionoperator= (const EnzymaticDigestion &rhs)
 Assignment operator. More...
 
virtual ~EnzymaticDigestion ()
 Destructor. More...
 
Size getMissedCleavages () const
 Returns the number of missed cleavages for the digestion. More...
 
void setMissedCleavages (Size missed_cleavages)
 Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used. More...
 
String getEnzymeName () const
 Returns the enzyme for the digestion. More...
 
virtual void setEnzyme (const DigestionEnzyme *enzyme)
 Sets the enzyme for the digestion. More...
 
Specificity getSpecificity () const
 Returns the specificity for the digestion. More...
 
void setSpecificity (Specificity spec)
 Sets the specificity for the digestion (default is SPEC_FULL). More...
 
Size digestUnmodified (const StringView &sequence, std::vector< StringView > &output, Size min_length=1, Size max_length=0) const
 Performs the enzymatic digestion of an unmodified sequence. More...
 
Size digestUnmodified (const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=1, Size max_length=0) const
 Performs the enzymatic digestion of an unmodified sequence. More...
 
bool isValidProduct (const String &protein, int pep_pos, int pep_length, bool ignore_missed_cleavages=true) const
 Is the peptide fragment starting at position pep_pos with length pep_length within the sequence protein generated by the current enzyme? More...
 
Size countInternalCleavageSites (const String &sequence) const
 Counts the number of internal cleavage sites (missed cleavages) in a protein sequence. More...
 
bool filterByMissedCleavages (const String &sequence, const std::function< bool(const Int)> &filter) const
 Filter based on the number of missed cleavages. More...
 

Static Public Member Functions

static Specificity getSpecificityByName (const String &name)
 

Static Public Attributes

static const std::string NamesOfSpecificity [SIZE_OF_SPECIFICITY]
 Names of the Specificity. More...
 
static const std::string NoCleavage
 Name for no cleavage. More...
 
static const std::string UnspecificCleavage
 Name for unspecific cleavage. More...
 

Protected Member Functions

bool isValidProduct_ (const String &sequence, int pos, int length, bool ignore_missed_cleavages, bool allow_nterm_protein_cleavage, bool allow_random_asp_pro_cleavage) const
 supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_() More...
 
std::vector< int > tokenize_ (const String &sequence, int start=0, int end=-1) const
 Digests the sequence using the enzyme's regular expression. More...
 
Size digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< StringView > &output, Size min_length=0, Size max_length=-1) const
 Helper function for digestUnmodified() More...
 
Size digestAfterTokenize_ (const std::vector< int > &fragment_positions, const StringView &sequence, std::vector< std::pair< Size, Size >> &output, Size min_length=0, Size max_length=-1) const
 
Size countMissedCleavages_ (const std::vector< int > &cleavage_positions, Size seq_start, Size seq_end) const
 Counts the number of missed cleavages in a sequence fragment. More...
 

Protected Attributes

Size missed_cleavages_
 Number of missed cleavages. More...
 
const DigestionEnzymeenzyme_
 Used enzyme. More...
 
std::unique_ptr< boost::regex > re_
 Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_()) More...
 
Specificity specificity_
 specificity of enzyme More...
 

Detailed Description

Class for the enzymatic digestion of sequences.

Digestion can be performed using simple regular expressions, e.g. [KR] | [^P] for trypsin. Also missed cleavages can be modeled, i.e. adjacent peptides are not cleaved due to enzyme malfunction/access restrictions. If n missed cleavages are given, all possible resulting peptides (cleaved and uncleaved) with up to n missed cleavages are returned. Thus no random selection of just n specific missed cleavage sites is performed.

See also
ProteaseDigestion for functionality specific to protein digestion.

Member Enumeration Documentation

◆ Specificity

when querying for valid digestion products, this determines if the specificity of the two peptide ends is considered important

Enumerator
SPEC_NONE 

no requirements on start / end

SPEC_SEMI 

semi specific, i.e., one of the two cleavage sites must fulfill requirements

SPEC_FULL 

fully enzyme specific, e.g., tryptic (ends with KR, AA-before is KR), or peptide is at protein terminal ends

SPEC_UNKNOWN 
SPEC_NOCTERM 

no requirements on CTerm (currently not supported in the class)

SPEC_NONTERM 

no requirements on NTerm (currently not supported in the class)

SIZE_OF_SPECIFICITY 

Constructor & Destructor Documentation

◆ EnzymaticDigestion() [1/2]

Default constructor.

◆ EnzymaticDigestion() [2/2]

Copy constructor.

◆ ~EnzymaticDigestion()

virtual ~EnzymaticDigestion ( )
virtual

Destructor.

Member Function Documentation

◆ countInternalCleavageSites()

Size countInternalCleavageSites ( const String sequence) const

Counts the number of internal cleavage sites (missed cleavages) in a protein sequence.

Parameters
sequenceSequence
Returns
Number of internal cleavage sites (= missed cleavages in the sequence)

◆ countMissedCleavages_()

Size countMissedCleavages_ ( const std::vector< int > &  cleavage_positions,
Size  seq_start,
Size  seq_end 
) const
protected

Counts the number of missed cleavages in a sequence fragment.

Parameters
cleavage_positionsPositions of cleavage in protein as obtained from tokenize_()
seq_startIndex into sequence
seq_endPast-the-end index into sequence
Returns
number of missed cleavages of peptide

◆ digestAfterTokenize_() [1/2]

Size digestAfterTokenize_ ( const std::vector< int > &  fragment_positions,
const StringView sequence,
std::vector< std::pair< Size, Size >> &  output,
Size  min_length = 0,
Size  max_length = -1 
) const
protected

◆ digestAfterTokenize_() [2/2]

Size digestAfterTokenize_ ( const std::vector< int > &  fragment_positions,
const StringView sequence,
std::vector< StringView > &  output,
Size  min_length = 0,
Size  max_length = -1 
) const
protected

Helper function for digestUnmodified()

This function implements digestUnmodified() starting from the result of tokenize_(). The separation enables derived classes to modify the result of tokenize_() during the in-silico digestion.

Returns
number of digestion products NOT matching the length restrictions

◆ digestUnmodified() [1/2]

Size digestUnmodified ( const StringView sequence,
std::vector< std::pair< Size, Size >> &  output,
Size  min_length = 1,
Size  max_length = 0 
) const

Performs the enzymatic digestion of an unmodified sequence.

By returning only positions into the original string this is very fast and compared to the StringView output version of this function it is independent of the original sequence. Can be used for matching products to determine e.g. missing ones.

Todo:
could be set of pairs.
Parameters
sequenceSequence to digest
outputDigestion products as vector of pairs of start and end positions
min_lengthMinimal length of reported products
max_lengthMaximal length of reported products (0 = no restriction)
Returns
Number of discarded digestion products (which are not matching length restrictions)

◆ digestUnmodified() [2/2]

Size digestUnmodified ( const StringView sequence,
std::vector< StringView > &  output,
Size  min_length = 1,
Size  max_length = 0 
) const

Performs the enzymatic digestion of an unmodified sequence.

By returning only references into the original string this is very fast.

Parameters
sequenceSequence to digest
outputDigestion products
min_lengthMinimal length of reported products
max_lengthMaximal length of reported products (0 = no restriction)
Returns
Number of discarded digestion products (which are not matching length restrictions)

◆ filterByMissedCleavages()

bool filterByMissedCleavages ( const String sequence,
const std::function< bool(const Int)> &  filter 
) const

Filter based on the number of missed cleavages.

Parameters
sequenceUnmodified (!) amino acid sequence to check.
filterA predicate that takes as parameter the number of missed cleavages in the sequence and returns true if the sequence should be filtered out.
Returns
Whether the sequence should be filtered out.

Referenced by IDFilter::PeptideDigestionFilter::operator()().

◆ getEnzymeName()

String getEnzymeName ( ) const

Returns the enzyme for the digestion.

◆ getMissedCleavages()

Size getMissedCleavages ( ) const

Returns the number of missed cleavages for the digestion.

◆ getSpecificity()

Specificity getSpecificity ( ) const

Returns the specificity for the digestion.

◆ getSpecificityByName()

static Specificity getSpecificityByName ( const String name)
static

convert spec string name to enum returns SPEC_UNKNOWN if name is not valid

◆ isValidProduct()

bool isValidProduct ( const String protein,
int  pep_pos,
int  pep_length,
bool  ignore_missed_cleavages = true 
) const

Is the peptide fragment starting at position pep_pos with length pep_length within the sequence protein generated by the current enzyme?

Checks if peptide is a valid digestion product of the enzyme, taking into account specificity and the MC flag provided here.

Parameters
proteinProtein sequence
pep_posStarting index of potential peptide
pep_lengthLength of potential peptide
ignore_missed_cleavagesDo not compare MC's of potential peptide to the maximum allowed MC's
Returns
True if peptide has correct n/c terminals (according to enzyme, specificity and missed cleavages)

◆ isValidProduct_()

bool isValidProduct_ ( const String sequence,
int  pos,
int  length,
bool  ignore_missed_cleavages,
bool  allow_nterm_protein_cleavage,
bool  allow_random_asp_pro_cleavage 
) const
protected

supports functionality for ProteaseDigestion as well (which is deeply weaved into the function) To avoid code duplication, this is stored here and called by wrappers. Do not duplicate the code, just for the sake of semantics (unless we can come up with a clean separation) Note: the overhead of allow_nterm_protein_cleavage and allow_random_asp_pro_cleavage is marginal; the main runtime is spend during tokenize_()

◆ operator=()

EnzymaticDigestion& operator= ( const EnzymaticDigestion rhs)

Assignment operator.

◆ setEnzyme()

virtual void setEnzyme ( const DigestionEnzyme enzyme)
virtual

Sets the enzyme for the digestion.

Reimplemented in RNaseDigestion.

◆ setMissedCleavages()

void setMissedCleavages ( Size  missed_cleavages)

Sets the number of missed cleavages for the digestion (default is 0). This setting is ignored when log model is used.

Referenced by NucleicAcidSearchEngine::main_().

◆ setSpecificity()

void setSpecificity ( Specificity  spec)

Sets the specificity for the digestion (default is SPEC_FULL).

◆ tokenize_()

std::vector<int> tokenize_ ( const String sequence,
int  start = 0,
int  end = -1 
) const
protected

Digests the sequence using the enzyme's regular expression.

The resulting split positions include start as first position, but not end. If start is negative, it is reset to zero. If end is negative or beyond sequence's size(), it is set to size(). All returned positions are relative to the full sequence.

Returned positions include start and any positions between start and end matching the regex.

Parameters
sequence...
startStart digestion after this point
endPast-the-end index into sequence
Returns
Cleavage positions (this includes start, but not end)

Member Data Documentation

◆ enzyme_

const DigestionEnzyme* enzyme_
protected

Used enzyme.

◆ missed_cleavages_

Size missed_cleavages_
protected

Number of missed cleavages.

◆ NamesOfSpecificity

const std::string NamesOfSpecificity[SIZE_OF_SPECIFICITY]
static

Names of the Specificity.

◆ NoCleavage

const std::string NoCleavage
static

Name for no cleavage.

◆ re_

std::unique_ptr<boost::regex> re_
protected

Regex for tokenizing (huge speedup by making this a member instead of stack object in tokenize_())

◆ specificity_

Specificity specificity_
protected

specificity of enzyme

◆ UnspecificCleavage

const std::string UnspecificCleavage
static

Name for unspecific cleavage.