Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information.
More...
Refreshes the protein references for all peptide hits in a vector of PeptideIdentifications and adds target/decoy information.
All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string
) as a suffix or prefix, respectively (see parameter prefix
). For peptides, the possible values are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)
- Note
- Make sure that your protein names in the database contain a correctly formatted decoy string. This can be ensured by using DecoyDatabase. If the decoy identifier is not recognized successfully all proteins will be assumed to stem from the target-part of the query.
E.g., "sw|P33354_DECOY|YEHR_ECOLI Uncharacterized lipop..." is invalid, since the tool has no knowledge of how SwissProt entries are build up. A correct identifier could be "DECOY_sw|P33354|YEHR_ECOLI Uncharacterized li ..." or "sw|P33354|YEHR_ECOLI_DECOY Uncharacterized li", depending on whether you are using prefix or suffix annotation.
Some helpful target/decoy statistics will be reported when done.
By default this tool will fail if an unmatched peptide occurs, i.e. if the database does not contain the corresponding protein. You can force it to return successfully in this case by using the flag allow_unmatched
.
Search engines (such as Mascot) will replace ambiguous amino acids ('B', 'J', 'Z' and 'X') in the protein database with unambiguous amino acids in the reported peptides, e.g. exchange 'X' with 'H'. This will cause such peptides to not be found by exactly matching their sequences to the protein database. However, we can recover these cases by using tolerant search for ambiguous amino acids in the protein sequence. This is done by default with up to four amino acids per peptide hit. If you only want exact matches, set aaa_max
to zero (but expect that unmatched peptides might occur)!
Leucine/Isoleucine: Further complications can arise due to the presence of the isobaric amino acids isoleucine ('I') and leucine ('L') in protein sequences. Since the two have the exact same chemical composition and mass, they generally cannot be distinguished by mass spectrometry. If a peptide containing 'I' was reported as a match for a spectrum, a peptide containing 'L' instead would be an equally good match (and vice versa). To account for this inherent ambiguity, setting the flag IL_equivalent
causes 'I' and 'L' to be considered as indistinguishable.
For example, if the sequence "PEPTIDE" (matching "Protein1") was identified as a search hit, but the database additionally contained "PEPTLDE" (matching "Protein2"), running PeptideIndexer with the IL_equivalent
option would report both "Protein1" and "Protein2" as accessions for "PEPTIDE". (This is independent of ambiguous matching via aaa_max
.) Additionally, setting this flag will convert all 'J's in any protein sequence to 'I'. This way, no tolerant search is required for 'J' (but is still possible for all the other ambiguous amino acids). If write_protein_sequences
is requested and IL_equivalent
is set as well, both the I/L-version and unmodified protein sequences need to be stored internally. This requires some extra memory, roughly equivalent to the size of the FASTA database file itself.
Enzyme specificity: Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (i.e. the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P].
We make two exceptions to the specificity constraints: 1) for peptides starting at the second or third position of a protein are still considered N-terminally specific, since the residues can be cleaved off in vivo; X!Tandem reports these peptides. For example, the two peptides ABAR and LABAR would both match a protein starting with MLABAR. 2) adventitious cleavage at Asp|Pro (Aspartate/D | Proline/P) is allowed for all enzymes (as supported by X!Tandem), i.e. counts as a proper cleavage site (see http://www.thegpm.org/tandem/release.html).
You can relax the requirements further by choosing semi-tryptic
(only one of two "internal" termini must match requirements) or none
(essentially allowing all hits, no matter their context). These settings should not be used (due to high risk of reporting false positives), unless the search engine was instructed to search peptides in the same way.
The FASTA file should not contain duplicate protein accessions (since accessions are not validated) if a correct unique-matching annotation is important (target/decoy annotation is still correct).
Threading: This tool support multiple threads (threads
option) to speed up computation, at the cost of little extra memory.
Re-index peptide identifications honoring enzyme cutting rules, ambiguous amino acids and target/decoy hits.
Template parameter 'T' can be either TFI_File or TFI_Vector. If the data is already available, use TFI_Vector and pass the vector. If the data is still in a FASTA file and its not needed afterwards for additional processing, use TFI_File and pass the filename.
PeptideIndexer refreshes target/decoy information and mapping of peptides to proteins. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)
PeptideIndexer allows for ambiguous amino acids (B|J|Z|X) in the protein database, but not in the peptide sequences. For the latter only I/L can be treated as equivalent (see 'IL_equivalent' flag), but 'J' is not allowed.
Enzyme cutting rules and partial specificity can be specified.
Resulting protein hits appear in the order of the FASTA file, except for orphaned proteins, which will appear first with an empty target_decoy metavalue. Duplicate protein accessions & sequences will not raise a warning, but create multiple hits (PeptideIndexer scans over the FASTA file once for efficiency reasons, and thus might not see all accessions & sequences at once).
All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string
) as a suffix or prefix, respectively (see parameter prefix
).
Peptide hits are annotated with metavalue 'protein_references', and if matched to at least one protein also with metavalue 'target_decoy'. The possible values for 'target_decoy' are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The metavalue is not present, if the peptide is unmatched.
Runtime: PeptideIndexer is usually very fast (loading and storing the data takes the most time) and search speed can be further improved (linearly), but using more threads. Avoid allowing too many (>=4) ambiguous amino acids if your database contains long stretches of 'X' (exponential search space).
- Parameters
-
proteins | A list of proteins – either read piecewise from a FASTA file or as existing vector of FASTAEntries. |
prot_ids | Resulting protein identifications associated to pep_ids (will be re-written completely) |
pep_ids | Peptide identifications which should be search within proteins and then linked to prot_ids |
- Returns
- Exit status codes.
protein hits of this peptide
for proteins –> peptides
exit if no peptides were matched to decoy
References SysInfo::MemUsage::after(), DEBUG_ONLY, SysInfo::MemUsage::delta(), FASTAFile::FASTAEntry::description, PeptideIndexing::FoundProteinFunctor::filter_passed, PeptideIndexing::FoundProteinFunctor::filter_rejected, StopWatch::getClockTime(), EnzymaticDigestion::getEnzymeName(), EnzymaticDigestion::getSpecificity(), EnzymaticDigestion::getSpecificityByName(), Map< Key, T >::has(), String::has(), String::hasPrefix(), String::hasSuffix(), AhoCorasickAmbiguous::initPattern(), seqan::isAmbiguous(), LOG_ERROR, LOG_INFO, LOG_WARN, PeptideIndexing::FoundProteinFunctor::merge(), PeptideIndexing::FoundProteinFunctor::pep_to_prot, String::remove(), StopWatch::reset(), FASTAFile::FASTAEntry::sequence, ProteinHit::setAccession(), ProteinHit::setDescription(), ProteaseDigestion::setEnzyme(), MetaInfoInterface::setMetaValue(), ProteinHit::setSequence(), EnzymaticDigestion::setSpecificity(), EnzymaticDigestion::SPEC_NONE, EnzymaticDigestion::SPEC_SEMI, StopWatch::start(), StopWatch::stop(), String::substitute(), String::substr(), and StopWatch::toString().