![]() |
OpenMS
|
This class holds the functionality of calculating the database suitability. More...
#include <OpenMS/QC/DBSuitability.h>
Classes | |
struct | SuitabilityData |
struct to store results More... | |
Public Member Functions | |
DBSuitability () | |
~DBSuitability () override=default | |
Destructor. More... | |
void | compute (std::vector< PeptideIdentification > &&pep_ids, const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &original_fasta, const std::vector< FASTAFile::FASTAEntry > &novo_fasta, const ProteinIdentification::SearchParameters &search_params) |
Computes suitability of a database used to search a mzML. More... | |
const std::vector< SuitabilityData > & | getResults () const |
Returns results calculated by this metric. More... | |
![]() | |
DefaultParamHandler (const String &name) | |
Constructor with name that is displayed in error messages. More... | |
DefaultParamHandler (const DefaultParamHandler &rhs) | |
Copy constructor. More... | |
virtual | ~DefaultParamHandler () |
Destructor. More... | |
DefaultParamHandler & | operator= (const DefaultParamHandler &rhs) |
Assignment operator. More... | |
virtual bool | operator== (const DefaultParamHandler &rhs) const |
Equality operator. More... | |
void | setParameters (const Param ¶m) |
Sets the parameters. More... | |
const Param & | getParameters () const |
Non-mutable access to the parameters. More... | |
const Param & | getDefaults () const |
Non-mutable access to the default parameters. More... | |
const String & | getName () const |
Non-mutable access to the name. More... | |
void | setName (const String &name) |
Mutable access to the name. More... | |
const std::vector< String > & | getSubsections () const |
Non-mutable access to the registered subsections. More... | |
Private Member Functions | |
double | getDecoyDiff_ (const PeptideIdentification &pep_id) const |
Calculates the xcorr difference between the top two hits marked as decoy. More... | |
double | getDecoyCutOff_ (const std::vector< PeptideIdentification > &pep_ids, double reranking_cutoff_percentile) const |
Calculates a xcorr cut-off based on decoy hits. More... | |
bool | isNovoHit_ (const PeptideHit &hit) const |
Tests if a PeptideHit is considered a deNovo hit. More... | |
bool | checkScoreBetterThanThreshold_ (const PeptideHit &hit, double threshold, bool higher_score_better) const |
Tests if a PeptideHit has a score better than the given threshold. More... | |
std::pair< String, Param > | extractSearchAdapterInfoFromMetaValues_ (const ProteinIdentification::SearchParameters &meta_values) const |
Looks through meta values of SearchParameters to find out which search adapter was used. More... | |
void | writeIniFile_ (const Param ¶meters, const String &filename) const |
Writes parameters into a given file. More... | |
std::vector< PeptideIdentification > | runIdentificationSearch_ (const MSExperiment &exp, const std::vector< FASTAFile::FASTAEntry > &fasta_data, const String &adapter_name, Param ¶meters) const |
Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR. More... | |
std::vector< FASTAFile::FASTAEntry > | getSubsampledFasta_ (const std::vector< FASTAFile::FASTAEntry > &fasta_data, double subsampling_rate) const |
Creates a subsampled fasta with the given subsampling rate. More... | |
void | calculateSuitability_ (const std::vector< PeptideIdentification > &pep_ids, SuitabilityData &data) const |
Calculates all suitability data from a combined deNovo+database search. More... | |
void | appendDecoys_ (std::vector< FASTAFile::FASTAEntry > &fasta) const |
Calculates and appends decoys to a given vector of FASTAEntry. More... | |
double | extractScore_ (const PeptideHit &pep_hit) const |
Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned. More... | |
double | calculateCorrectionFactor_ (const SuitabilityData &data, const SuitabilityData &data_sampled, double sampling_rate) const |
Calculates the correction factor from two suitability calculations. More... | |
UInt | numberOfUniqueProteins_ (const std::vector< PeptideIdentification > &peps, UInt number_of_hits=1) const |
Determines the number of unique proteins found in the protein accessions of PeptideIdentifications. More... | |
Size | getIndexWithMedianNovoHits_ (const std::vector< SuitabilityData > &data) const |
Finds the SuitabilityData object with the median number of de novo hits. More... | |
double | getScoreMatchingFDR_ (const std::vector< PeptideIdentification > &pep_ids, double FDR, const String &score_name, bool higher_score_better) const |
Extracts the worst score that still passes a FDR (q-value) threshold. More... | |
Private Attributes | |
std::vector< SuitabilityData > | results_ |
result vector More... | |
const boost::regex | decoy_pattern_ |
pattern for finding a decoy string More... | |
Friends | |
class | DBSuitability_friend |
To test private member functions. More... | |
Additional Inherited Members | |
![]() | |
static void | writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const String &key_prefix="") |
Writes all parameters to meta values. More... | |
![]() | |
virtual void | updateMembers_ () |
This method is used to update extra member variables at the end of the setParameters() method. More... | |
void | defaultsToParam_ () |
Updates the parameters after the defaults have been set in the constructor. More... | |
![]() | |
Param | param_ |
Container for current parameters. More... | |
Param | defaults_ |
Container for default parameters. This member should be filled in the constructor of derived classes! More... | |
std::vector< String > | subsections_ |
Container for registered subsections. This member should be filled in the constructor of derived classes! More... | |
String | error_name_ |
Name that is displayed in error messages during the parameter checking. More... | |
bool | check_defaults_ |
If this member is set to false no checking if parameters in done;. More... | |
bool | warn_empty_defaults_ |
If this member is set to false no warning is emitted when defaults are empty;. More... | |
This class holds the functionality of calculating the database suitability.
To calculate the suitability of a database for a specific mzML for identification search, it is vital to perform a combined deNovo+database identification search. Meaning that the database should be appended with an additional entry derived from concatenated deNovo sequences from said mzML. Currently only Comet search is supported.
This class will calculate q-values by itself and will throw an error if any q-value calculation was done beforehand.
The algorithm parameters can be set using setParams().
Allows for multiple usage of the compute function. The result of each call is stored internally in a vector. Therefore old results will not be overridden by a new call. This vector then can be returned using getResults().
This class serves as the library representation of DatabaseSuitability
DBSuitability | ( | ) |
Constructor Settings are initialized with their default values: no_rerank = false, reranking_cutoff_percentile = 1, FDR = 0.01
|
overridedefault |
Destructor.
|
private |
Calculates and appends decoys to a given vector of FASTAEntry.
Each sequence is digested with Trypsin. The resulting peptides are reversed and appended to one another. This results in the decoy sequences. The identifier is given a 'DECOY_' prefix.
fasta | reference to fasta vector where the decoys are needed |
Referenced by DBSuitability_friend::appendDecoys().
|
private |
Calculates the correction factor from two suitability calculations.
Two suitability calculations need to be done for this. One with the original data and one with data from a search with a sampled database. The number of db hits and deNovo hits behaves linear. The two searches can than be used to calculate the corresponding linear functions. The factor is calculated with the negative ratio of the db slope and the deNovo slope.
data | suitability data from the original search |
data_sampled | vector of suitability data from the sampled search(s) |
sampling_rate | the sampling rate used for sampled db [0,1) |
Referenced by DBSuitability_friend::calculateCorrectionFactor().
|
private |
Calculates all suitability data from a combined deNovo+database search.
Counts top database and top deNovo hits.
Calculates a decoy score cut-off to compare high scoring deNovo hits with lower scoring database hits. If the score difference is smaller than the cut-off the database hit is counted and the deNovo hit ignored.
Suitability is calculated: # database hits / # all hits
pep_ids | peptide identifications coming from the combined search, each peptide identification should be sorted |
data | SuitabilityData object where the result should be written into |
MissingInformation | if no target/decoy annotation is found on pep_ids |
MissingInformation | if no xcorr is found, this happens when another adapter than CometAdapter was used |
|
private |
Tests if a PeptideHit has a score better than the given threshold.
hit | PepHit in question |
threshold | threshold to check against |
higher_score_better | true/false depending if a higher or a lower score is better |
void compute | ( | std::vector< PeptideIdentification > && | pep_ids, |
const MSExperiment & | exp, | ||
const std::vector< FASTAFile::FASTAEntry > & | original_fasta, | ||
const std::vector< FASTAFile::FASTAEntry > & | novo_fasta, | ||
const ProteinIdentification::SearchParameters & | search_params | ||
) |
Computes suitability of a database used to search a mzML.
Top deNovo and top database hits from a combined deNovo+database search are counted. The ratio of db hits vs all hits yields the suitability. To re-rank cases, where a de novo peptide scores just higher than the database peptide, a decoy cut-off is calculated. This functionality can be turned off. This will result in an underestimated suitability, but it can solve problems like different search engines or to few decoy hits.
Parameters can be set using the functionality of DefaultParamHandler. Parameters are: no_rerank - re-ranking can be turned off with this (will be set automatically if no cross correlation score is found) reranking_cutoff_percentile - percentile that determines which cut-off will be returned FDR - q-value that should be filtered for Preliminary tests have shown that database suitability is rather stable across common FDR thresholds from 0 - 5 % keep_search_files - temporary files created for and by the internal ID search are kept disable_correction - disables corrected suitability calculations force - forces re-ranking to be done even without a cross correlation score, in which case the default main score is used
The calculated suitability is then tried to be corrected. For this a correction factor for the number of found top deNovo hits is calculated. This is done by perfoming an additional combined identification search with a smaller sample of the database. It was observed that the number of top deNovo and db hits behave linear according to the sampling ratio of the database. This can be used to extrapolate the number of database hits that would be needed to get a suitability of 1. This number in combination with the maximum number of deNovo hits (found with an identification search where only deNovo is used as a database) can be used to calculate a correction factor like this: #database hits for suitability of 1 / #maximum deNovo hits This formula can be simplified in a way that the maximum number of deNovo hits isn't needed:
Correcting the number of found top deNovo hits with this factor results in them being more comparable to the top database hits. This in return results in a more linear behaviour of the suitability according to the sampling ratio. The corrected suitability reflects what sampling ratio your database represents regarding to the theoretical 'perfect' database. Or in other words: Your database needs to be (1 - corrected suitability) bigger to get a suitability of 1.
Both the original suitability as well as the corrected one are reported in the result.
Since q-values need to be calculated the identifications are taken by copy. Since decoys need to be calculated for the fasta input those are taken by copy as well.
Result is appended to the result member. This allows for multiple usage.
pep_ids | vector containing pepIDs with target/decoy annotation coming from a deNovo+database identification search without FDR (Comet is recommended - to use other search engines either disable reranking or set the '-force' flag) vector is modified internally, and is thus copied |
exp | MSExperiment that was searched to produce the identifications given in pep_ids |
original_fasta | FASTAEntries of the database used for the ID search (without decoys) |
novo_fasta | FASTAEntry derived from deNovo peptides |
search_params | SearchParameters object containing information which adapter was used with which settings for the identification search that resulted in pep_ids |
MissingInformation | if no target/decoy annotation is found on pep_ids |
MissingInformation | if no xcorr is found, this happens when another adapter than CometAdapter was used |
Precondition | if a q-value is found in pep_ids |
|
private |
Returns the cross correlation score normalized by MW (if existing), else if the 'force' flag is set the current main score is returned.
pep_hit | PeptideHit of which the score is needed |
MissingInformation | if no xcorr is found and 'force' flag isn't set |
|
private |
Looks through meta values of SearchParameters to find out which search adapter was used.
Checks for the following adapters: CometAdapter, MSGFPlusAdapter, MSFraggerAdapter, MyriMatchAdapter, OMSSAAdapter and XTandemAdapter
meta_values | SearchParameters object, since the adapters write their parameters here |
MissingInformation | if none of the adapters above is found in the meta values |
|
private |
Calculates a xcorr cut-off based on decoy hits.
Decoy differences of all N pepIDs are calculated. The (1-reranking_cutoff_percentile)*N highest one is returned. It is assumed that this difference accounts for 'reranking_cutoff_percentile' of the re-ranking cases.
pep_ids | vector containing the pepIDs |
reranking_cutoff_percentile | percentile that determines which cut-off will be returned |
IllegalArgument | if reranking_cutoff_percentile isn't in range [0,1] |
IllegalArgument | if reranking_cutoff_percentile is too low for a decoy cut-off to be calculated |
MissingInformation | if no more than 20 % of the peptide IDs have two decoys in their top ten peptide hits |
|
private |
Calculates the xcorr difference between the top two hits marked as decoy.
Searches for the top two decoys hits and returns their score difference. By default the xcorr from Comet is used. If no xcorr can be found and the 'force' flag is set the main score from the peptide hit is used, else an error is thrown.
If there aren't two decoys, DBL_MAX is returned.
pep_id | pepID from where the decoy difference will be calculated |
MissingInformation | if no target/decoy annotation is found |
MissingInformation | if no xcorr is found |
|
private |
Finds the SuitabilityData object with the median number of de novo hits.
If the median isn't distinct (e.g. two entries could be considered median) the upper one is chosen.
data | vector of SuitabilityData objects |
Referenced by DBSuitability_friend::getIndexWithMedianNovoHits().
const std::vector<SuitabilityData>& getResults | ( | ) | const |
Returns results calculated by this metric.
The returned vector contains one DBSuitabilityData object for each time compute was called. Each of these objects contains the suitability information that was extracted from the identifications used for the corresponding call of compute.
|
private |
Extracts the worst score that still passes a FDR (q-value) threshold.
This can be used to 'convert' a FDR threshold to a threshold for the desired score (score and FDR need to be dependent)
pep_ids | vector of PeptideIdentifications |
FDR | FDR threshold, hits with a worse q-value score aren't looked at |
score_name | name of the score to search for The score name doesn't need to be the exact metavalue name, but a metavalue key should contain it. i.e. "e-value" as metavalue "e-value_score" |
higher_score_better | true/false depending if a higher or lower score (score_name ) is better |
IllegalArgument | if score_name isn't found in the metavalues |
Precondition | if main score of pep_ids isn't 'q-value' |
Referenced by DBSuitability_friend::getScoreMatchingFDR().
|
private |
Creates a subsampled fasta with the given subsampling rate.
The subsampling is based on the number of amino acides and not on the number of fasta entries.
fasta_data | fasta of which the subsampling should be done |
subsampling_rate | subsampling rate to be used [0,1] |
IllegalArgument | if subsampling rate is not between 0 and 1 |
Referenced by DBSuitability_friend::getSubsampledFasta().
|
private |
Tests if a PeptideHit is considered a deNovo hit.
To test this the function looks into the protein accessions. If only the deNovo protein is found, 'true' is returned. If at least one database protein is found, 'false' is returned.
This function also uses boost::regex_search to make sure the deNovo accession doesn't contain a decoy string. This is needed for 'target+decoy' hits.
hit | PepHit in question |
|
private |
Determines the number of unique proteins found in the protein accessions of PeptideIdentifications.
peps | vector of PeptideIdentifications |
number_of_hits | the number of hits to search in (if this is bigger than the actual number of hits all hits are looked at) |
MissingInformation | if no target/decoy annotation is found on peps |
Referenced by DBSuitability_friend::numberOfUniqueProteins().
|
private |
Executes the workflow from search adapter, followed by PeptideIndexer and finishes with FDR.
Which adapter should run with which parameters can be controlled. Make sure the search adapter you wish to use is built on your system and the executable is on your PATH variable.
Indexing and FDR are always done the same way.
The inputs are stored in temporary files to execute the Adapter. (MSExperiment -> .mzML, vector<FASTAEntry> -> .fasta, Param -> .INI)
exp | MSExperiment that will be searched |
fasta_data | represents the database that should be used to search |
adapter_name | name of the adapter to search with |
parameters | parameters for the adapter |
MissingInformation | if no adapter name is given |
InvalidParameter | if a not supported adapter name is given |
InternalToolError | if any error occures while running the adapter |
InternalToolError | if any error occures while running PeptideIndexer functionalities |
InvalidParameter | if the needed FDR parameters are not found |
Writes parameters into a given file.
parameters | parameters to write |
filename | name of the file where the parameters should be written to |
UnableToCreateFile | if filename isn't writable |
|
friend |
To test private member functions.
|
private |
pattern for finding a decoy string
|
private |
result vector