![]() |
OpenMS
|
Fragment-index-based peptide database search algorithm (experimental). More...
#include <OpenMS/ANALYSIS/ID/ProSEAlgorithm.h>
Classes | |
| struct | AnnotatedHit_ |
| Slimmer structure as storing all scored candidates in PeptideHit objects takes too much space. More... | |
| struct | CalibrationResult_ |
| Result of a calibration pass. More... | |
| struct | DecoyStrategy_ |
| Resolved decoy handling for one concrete input database. More... | |
| struct | MultiFileSearchResult |
| Multi-file search result bundle. More... | |
| struct | RunStatistics |
| Per-run identification statistics for the end-of-search report. More... | |
| struct | SearchContext |
| Prepared per-database state shared across multiple spectrum files. More... | |
| struct | SearchResult |
| Comprehensive search result including modification analysis. More... | |
| struct | SharedSearchStats |
| Configuration, database and fragment-index facts shared across all input files of one ProSE invocation. More... | |
Public Types | |
| enum class | ExitCodes { EXECUTION_OK , INPUT_FILE_EMPTY , UNEXPECTED_RESULT , UNKNOWN_ERROR , ILLEGAL_PARAMETERS } |
| Exit codes. More... | |
Public Types inherited from ProgressLogger | |
| enum | LogType { CMD , GUI , NONE } |
| Possible log types. More... | |
Public Member Functions | |
| ProSEAlgorithm () | |
| ExitCodes | search (const std::string &in_spectra, const std::string &in_db, std::vector< ProteinIdentification > &prot_ids, PeptideIdentificationList &pep_ids) const |
| Search spectra in a spectrum file (mzML or Bruker .d) against a protein database using an FI-backed workflow. | |
| SearchResult | searchWithModificationAnalysis (const std::string &in_spectra, const std::string &in_db, const std::string &output_base_name="") const |
| Search with comprehensive results including modification analysis tables. | |
| ExitCodes | search (PeakMap &spectra, const std::vector< FASTAFile::FASTAEntry > &fasta_db, std::vector< ProteinIdentification > &prot_ids, PeptideIdentificationList &pep_ids) const |
| In-memory search: search spectra against a protein database without file I/O. | |
| SearchContext | prepareContext (const std::vector< FASTAFile::FASTAEntry > &fasta_db) const |
| Build a SearchContext (decoy-augmented database + FragmentIndex) for reuse. | |
| ExitCodes | search (PeakMap &spectra, SearchContext &ctx, std::vector< ProteinIdentification > &prot_ids, PeptideIdentificationList &pep_ids) const |
| In-memory search using a pre-built SearchContext. | |
| SearchResult | searchWithModificationAnalysis (PeakMap &spectra, const std::vector< FASTAFile::FASTAEntry > &fasta_db, const std::string &output_base_name="") const |
| In-memory search with modification analysis: no file I/O required. | |
| MultiFileSearchResult | searchWithModificationAnalysis (const std::vector< std::string > &in_spectra_files, const std::vector< FASTAFile::FASTAEntry > &fasta_db, const std::vector< std::string > &output_base_names={}, const std::string &aggregate_base_name="") const |
| Multi-file search with modification analysis (in-memory FASTA). | |
| MultiFileSearchResult | searchWithModificationAnalysis (const std::vector< std::string > &in_spectra_files, const std::string &in_db, const std::vector< std::string > &output_base_names={}, const std::string &aggregate_base_name="") const |
| Multi-file search with modification analysis (FASTA file path). | |
Public Member Functions inherited from DefaultParamHandler | |
| DefaultParamHandler (const std::string &name) | |
| Constructor with name that is displayed in error messages. | |
| DefaultParamHandler (const DefaultParamHandler &rhs) | |
| Copy constructor. | |
| virtual | ~DefaultParamHandler () |
| Destructor. | |
| DefaultParamHandler & | operator= (const DefaultParamHandler &rhs) |
| Assignment operator. | |
| virtual bool | operator== (const DefaultParamHandler &rhs) const |
| Equality operator. | |
| void | setParameters (const Param ¶m) |
| Sets the parameters. | |
| const Param & | getParameters () const |
| Non-mutable access to the parameters. | |
| const Param & | getDefaults () const |
| Non-mutable access to the default parameters. | |
| const std::string & | getName () const |
| Non-mutable access to the name. | |
| void | setName (const std::string &name) |
| Mutable access to the name. | |
| const std::vector< std::string > & | getSubsections () const |
| Non-mutable access to the registered subsections. | |
Public Member Functions inherited from ProgressLogger | |
| ProgressLogger () | |
| Constructor. | |
| virtual | ~ProgressLogger () |
| Destructor. | |
| ProgressLogger (const ProgressLogger &other) | |
| Copy constructor. | |
| ProgressLogger & | operator= (const ProgressLogger &other) |
| Assignment Operator. | |
| void | setLogType (LogType type) const |
| Sets the progress log that should be used. The default type is NONE! | |
| LogType | getLogType () const |
| Returns the type of progress log being used. | |
| void | setLogger (ProgressLoggerImpl *logger) |
| Sets the logger to be used for progress logging. | |
| void | startProgress (SignedSize begin, SignedSize end, const std::string &label) const |
| Initializes the progress display. | |
| void | setProgress (SignedSize value) const |
| Sets the current progress. | |
| void | endProgress (UInt64 bytes_processed=0) const |
| void | nextProgress () const |
| increment progress by 1 (according to range begin-end) | |
Static Public Member Functions | |
| static void | applyCompleteSetProteinFDR (std::vector< ProteinIdentification > &protein_ids, PeptideIdentificationList &peptide_ids, const std::string &decoy_string, bool decoy_is_prefix, double protein_fdr) |
| Finalize protein-level FDR on a COMPLETE protein set (a single input file, or a merged cross-file aggregate). | |
| static void | updateFinalStats (RunStatistics &stats, const PeptideIdentificationList &peptide_ids, const std::string &enzyme, bool fdr_applied) |
| static void | renderRunSummary (const RunStatistics &stats, const SharedSearchStats &shared, const OpenSearchModificationAnalysis::OpenSearchAnalysisResult &mod_analysis, bool is_open_search, std::ostream &os) |
| static void | renderModificationSummary (const OpenSearchModificationAnalysis::OpenSearchAnalysisResult &mod_analysis, std::ostream &os) |
| static std::string | renderRunSummaryYaml (const MultiFileSearchResult &mfres, const std::vector< std::pair< std::string, std::vector< std::string > > > &manifest, Size files_failed, Size files_total) |
Static Public Member Functions inherited from DefaultParamHandler | |
| static void | writeParametersToMetaValues (const Param &write_this, MetaInfoInterface &write_here, const std::string &key_prefix="") |
| Writes all parameters to meta values. | |
Protected Types | |
| enum class | DecoyMode_ { AUTO , GENERATE , IGNORE } |
| How decoys are obtained/recognised for a search (parameter "decoys"). More... | |
Protected Member Functions | |
| void | updateMembers_ () override |
| This method is used to update extra member variables at the end of the setParameters() method. | |
| DecoyStrategy_ | resolveDecoyStrategy_ (const std::vector< FASTAFile::FASTAEntry > &db) const |
Decide how to obtain/recognise decoys for db. | |
| Param | fragmentIndexParameters_ () const |
| ProSE parameters made safe to hand to a FragmentIndex. | |
| std::vector< FASTAFile::FASTAEntry > | buildDecoyAugmentedDB_ (const std::vector< FASTAFile::FASTAEntry > &fasta_db, const DecoyStrategy_ &strategy) const |
Build the searched database according to strategy. | |
| std::vector< FASTAFile::FASTAEntry > | buildCalibrationSample_ (const std::vector< FASTAFile::FASTAEntry > &full_db) const |
| Build a strided protein sample for chunked calibration. | |
| ExitCodes | searchChunked_ (PeakMap &spectra, std::vector< FASTAFile::FASTAEntry > &full_db, const DecoyStrategy_ &strategy, std::vector< ProteinIdentification > &protein_ids, PeptideIdentificationList &peptide_ids) const |
| Chunked database search implementation. | |
| void | scoreSpectraAgainstIndex_ (const PeakMap &spectra, FragmentIndex &fi, const std::vector< FASTAFile::FASTAEntry > &db, const TheoreticalSpectrumGenerator &spectrum_generator, double effective_fragment_tol, bool fragment_mass_tolerance_unit_ppm, bool open_search_mode, std::vector< std::vector< AnnotatedHit_ > > &annotated_hits, const std::string &progress_label) const |
| Score all spectra against one FragmentIndex. | |
| void | postProcessHits_ (const PeakMap &exp, std::vector< std::vector< ProSEAlgorithm::AnnotatedHit_ > > &annotated_hits, std::vector< ProteinIdentification > &protein_ids, PeptideIdentificationList &peptide_ids, Size top_hits, const StringList &modifications_fixed, const StringList &modifications_variable, Int peptide_missed_cleavages, double precursor_mass_tolerance, double fragment_mass_tolerance, const std::string &precursor_mass_tolerance_unit_ppm, const std::string &fragment_mass_tolerance_unit_ppm, const Int precursor_min_charge, const Int precursor_max_charge, const std::string &enzyme, const std::string &database_name) const |
| Filter and annotate search results. | |
| double | computeModMatchTolerance_ () const |
| CalibrationResult_ | runCalibrationPass_ (PeakMap &spectra, FragmentIndex &fragment_index, const std::vector< FASTAFile::FASTAEntry > &db) const |
| Run a fast calibration pass on a subset of spectra to estimate mass accuracy. | |
| void | collectRunStatistics_ (const PeakMap &spectra, const std::vector< ProteinIdentification > &protein_ids, const PeptideIdentificationList &peptide_ids, RunStatistics &stats) const |
Protected Member Functions inherited from DefaultParamHandler | |
| void | defaultsToParam_ () |
| Updates the parameters after the defaults have been set in the constructor. | |
Static Protected Member Functions | |
| static void | preprocessSpectra_ (PeakMap &exp, double fragment_mass_tolerance, bool fragment_mass_tolerance_unit_ppm, bool deisotope_requested, Size peaks_keep_n, Int peaks_window_top) |
| filter, deisotope, decharge spectra | |
| static bool | accessionHasDecoyMarker_ (const std::string &accession, const std::string &marker, bool is_prefix) |
| static void | capturePreFdrStats_ (const PeptideIdentificationList &peptide_ids, RunStatistics &stats) |
| static double | maxRetainedScore_ (const PeptideIdentificationList &peptide_ids) |
Protected Attributes | |
| double | precursor_mass_tolerance_lower_ {10.0} |
| positive magnitude (default matches the param) | |
| double | precursor_mass_tolerance_upper_ {10.0} |
| positive magnitude (default matches the param) | |
| std::string | precursor_mass_tolerance_unit_ {"ppm"} |
| Size | precursor_min_charge_ |
| Size | precursor_max_charge_ |
| IntList | precursor_isotopes_ |
| double | fragment_mass_tolerance_ |
| std::string | fragment_mass_tolerance_unit_ |
| bool | deisotope_requested_ {true} |
| Size | peaks_keep_n_ {0} |
| NLargest cap on MS2 peaks before scoring; 0 = resolution-aware auto (peaks:keep_n) | |
| Int | peaks_window_top_ {20} |
| WindowMower peaks-per-100Da before scoring (peaks:window_top) | |
| StringList | modifications_fixed_ |
| StringList | modifications_variable_ |
| Size | modifications_max_variable_mods_per_peptide_ |
| std::string | enzyme_ |
| DecoyMode_ | decoy_mode_ {DecoyMode_::AUTO} |
| std::string | decoy_prefix_ |
| double | fdr_psm_ {0.0} |
| double | fdr_protein_ {0.0} |
| StringList | annotate_psm_ |
| Size | peptide_min_size_ |
| Size | peptide_max_size_ |
| Size | peptide_missed_cleavages_ |
| EnzymaticDigestion::Specificity | peptide_enzyme_specificity_ {EnzymaticDigestion::SPEC_FULL} |
| std::string | peptide_motif_ |
| Size | report_top_hits_ |
| bool | add_a_ions_ {false} |
| bool | add_b_ions_ {true} |
| bool | add_c_ions_ {false} |
| bool | add_x_ions_ {false} |
| bool | add_y_ions_ {true} |
| bool | add_z_ions_ {false} |
| Size | database_chunk_size_ {0} |
| 0 = disabled; >0 = chunk DB into groups of this many proteins | |
| bool | calibration_enabled_ {false} |
| double | calibration_subset_ratio_ {0.1} |
| Size | calibration_min_psms_ {50} |
| CalibrationResult_ | last_calibration_result_ |
| RunStatistics | last_run_stats_ |
| double | last_mod_match_tolerance_used_ {-1.0} |
Protected Attributes inherited from DefaultParamHandler | |
| Param | param_ |
| Container for current parameters. | |
| Param | defaults_ |
| Container for default parameters. This member should be filled in the constructor of derived classes! | |
| std::vector< std::string > | subsections_ |
| Container for registered subsections. This member should be filled in the constructor of derived classes! | |
| std::string | error_name_ |
| Name that is displayed in error messages during the parameter checking. | |
| bool | check_defaults_ |
| If this member is set to false no checking if parameters in done;. | |
| bool | warn_empty_defaults_ |
| If this member is set to false no warning is emitted when defaults are empty;. | |
Protected Attributes inherited from ProgressLogger | |
| LogType | type_ |
| time_t | last_invoke_ |
| ProgressLoggerImpl * | current_logger_ |
Private Member Functions | |
| bool | isOpenSearchMode_ () const |
| Helper function to determine if open search should be used based on tolerance. | |
Additional Inherited Members | |
Static Protected Attributes inherited from ProgressLogger | |
| static int | recursion_depth_ |
Fragment-index-based peptide database search algorithm (experimental).
Provides a self-contained search engine that matches MS/MS spectra against a protein database using an FI (Fragment Index). Typical usage:
Notes:
| struct OpenMS::ProSEAlgorithm::CalibrationResult_ |
Result of a calibration pass.
Holds the estimated precursor and fragment tolerances computed from confident PSMs during the calibration pass. When success is false, the tolerance values are undefined and should not be used.
| struct OpenMS::ProSEAlgorithm::DecoyStrategy_ |
Resolved decoy handling for one concrete input database.
Produced by resolveDecoyStrategy_() and consumed by buildDecoyAugmentedDB_() and the downstream PeptideIndexing / FDR steps, so the same decoys that are searched are also the ones scored.
| Class Members | ||
|---|---|---|
| string | decoy_string | effective marker for PeptideIndexing + protein FDR |
| bool | generate {false} | reverse target proteins to synthesise decoys |
| bool | have_decoys {false} | searched DB will contain decoys (FDR possible) |
| bool | is_prefix {true} | position of decoy_string |
| bool | strip_existing {false} | drop pre-existing decoy entries before searching |
| bool | strip_is_prefix {true} | position of strip_string |
| string | strip_string | marker of pre-existing decoys to strip |
| struct OpenMS::ProSEAlgorithm::MultiFileSearchResult |
Multi-file search result bundle.
Returned by the file-list searchWithModificationAnalysis() overloads. Holds one SearchResult per input file (in per_file, in input order) and a single aggregate result whose peptide_ids are the concatenation of all per-file PSMs and whose modification_analysis is computed once on the pooled set of PSMs.
Special cases for aggregate:
aggregate is left almost-empty (only is_open_search and exit_code are set) — the single-file pooled aggregate would just duplicate per_file[0] and re-run modification analysis on the same PSMs. Callers should use per_file[0] for the result in this case.aggregate.exit_code is set to the first non-OK per-file exit code (so callers can inspect it without walking the per_file vector).The aggregate's protein_ids template is taken from the first successful per-file result (search parameters are identical across files by construction), with the primary MS run path overwritten to list every input file.
| Class Members | ||
|---|---|---|
| SearchResult | aggregate | |
| bool | decoy_is_prefix = true |
Position of decoy_string (true = prefix, false = suffix). |
| string | decoy_string |
Effective decoy marker resolved from the shared database, for a caller-side merged-PSM protein-FDR step (e.g. the ProSE TOPP tool's -out_merged path). Empty when the search was target-only (decoys=ignore). |
| bool | have_decoys = false | True when the searched databases contained decoys (FDR possible). |
| vector< SearchResult > | per_file | |
| SharedSearchStats | shared |
Configuration / database / fragment-index facts shared across all input files (the index is built once and reused), for the end-of-search report. |
| struct OpenMS::ProSEAlgorithm::RunStatistics |
Per-run identification statistics for the end-of-search report.
Populated by collectRunStatistics_() once a single spectrum file has been searched (post-FDR), plus a few fields captured at well-defined points during search() (target/decoy counts pre-FDR, achieved q-value, timing). All counts refer to one input file. Cross-file/shared facts (database, fragment index, configuration) live in SharedSearchStats instead.
| Class Members | ||
|---|---|---|
| double | achieved_psm_fdr = -1.0 | max retained q-value after FDR (<0 = n/a) |
| map< Int, Size > | charge_histogram | precursor charge -> PSM count |
| Size | decoy_psms = 0 | decoy PSMs in the final IDs (after FDR, if applied) |
| bool | fdr_applied = false | true if PSM-level FDR filtering ran |
| double | frag_err_mad = 0.0 | |
| double | frag_err_median = 0.0 | |
| double | frag_err_recommended = 0.0 | |
| bool | frag_tol_valid = false | true if fragment-error estimate present |
| double | hyperscore_max = 0.0 | |
| double | hyperscore_median = 0.0 | |
| double | hyperscore_min = 0.0 | |
| string | input_file | spectrum file this run searched (basename or path) |
| Size | matched_spectra = 0 | spectra with >=1 retained PSM in the final IDs (after FDR, if applied) |
| map< Size, Size > | missed_cleavage_histogram | missed cleavages -> PSM count |
| Size | ms2_spectra = 0 | number of MS2 spectra in the input |
| double | prec_err_mad = 0.0 | |
| double | prec_err_median = 0.0 | |
| double | prec_err_recommended = 0.0 | |
| bool | prec_tol_valid = false | true if precursor-error estimate present |
| bool | score_stats_valid = false | true if hyperscore_* below are meaningful |
| double | seconds_calibration = 0.0 | calibration pass wall time (0 if disabled) |
| double | seconds_fdr = 0.0 | FDR filtering wall time (0 if not applied) |
| double | seconds_search = 0.0 | scoring + post-processing wall time |
| Size | target_psms = 0 | target PSMs in the final IDs (after FDR, if applied) |
| Size | unique_peptides = 0 | distinct peptide sequences among top hits |
| Size | unique_proteins = 0 | distinct protein accessions among top hits |
| struct OpenMS::ProSEAlgorithm::SearchContext |
Prepared per-database state shared across multiple spectrum files.
Holds the (decoy-augmented) protein database and the built FragmentIndex so that searching N spectrum files against the same FASTA pays the index build cost only once. Construct via prepareContext() and pass to the context-taking search() overload.
| Class Members | ||
|---|---|---|
| vector< FASTAEntry > | db | |
| bool | decoy_is_prefix = true |
Position of decoy_string (true = prefix, false = suffix). |
| string | decoy_string |
Effective decoy marker carried by |
| FragmentIndex | fragment_index | |
| bool | have_decoys = false |
True when |
| bool | release_fragment_index_after_scoring = false |
When true, the context-taking search() overload will release |
| struct OpenMS::ProSEAlgorithm::SearchResult |
Comprehensive search result including modification analysis.
This structure contains all outputs from an open search including:
| Class Members | ||
|---|---|---|
| ExitCodes | exit_code = ExitCodes::EXECUTION_OK | |
| bool | is_open_search = false | |
| OpenSearchAnalysisResult | modification_analysis | |
| PeptideIdentificationList | peptide_ids | |
| vector< ProteinIdentification > | protein_ids | |
| RunStatistics | stats | |
| struct OpenMS::ProSEAlgorithm::SharedSearchStats |
Configuration, database and fragment-index facts shared across all input files of one ProSE invocation.
Computed once (the fragment index is built once and reused), so these costs/counts must NOT be summed per file. Populated by the multi-file searchWithModificationAnalysis() overloads.
| Class Members | ||
|---|---|---|
| bool | calibration_enabled = false | |
| bool | chunked = false | |
| string | database_file | FASTA path (empty for in-memory db) |
| Size | db_decoy_proteins = 0 | decoy entries in the searched (augmented) db |
| Size | db_target_proteins = 0 | target entries in the searched (augmented) db |
| string | decoy_mode | "generated" | "external" | "none (target-only)" |
| string | enzyme | |
| vector< string > | fixed_mods | |
| double | fragment_tol = 0.0 | |
| string | fragment_tol_unit | |
| Size | indexed_fragments = 0 | theoretical fragments in the index (summed over chunks) |
| Size | indexed_peptides = 0 | peptides in the fragment index (summed over chunks) |
| vector< string > | ion_series | |
| Int | max_charge = 0 | |
| Int | min_charge = 0 | |
| Size | missed_cleavages = 0 | |
| bool | open_search = false | |
| double | precursor_tol_lower = 0.0 | |
| string | precursor_tol_unit | |
| double | precursor_tol_upper = 0.0 | |
| double | protein_fdr_threshold = 0.0 | |
| double | psm_fdr_threshold = 0.0 | |
| double | seconds_index_build = 0.0 | decoy generation + fragment index build wall time |
| double | seconds_total = 0.0 | whole-search wall time (set by the caller) |
| bool | snes_mode = false | |
| vector< string > | variable_mods | |
|
strongprotected |
|
strong |
| ProSEAlgorithm | ( | ) |
|
staticprotected |
Helper: does accession carry the decoy marker at the given position? Empty marker → false. Pure std::string (no String dependency).
|
static |
Finalize protein-level FDR on a COMPLETE protein set (a single input file, or a merged cross-file aggregate).
Runs protein inference (BasicProteinInferenceAlgorithm), picked-protein FDR (Savitski et al. 2015), threshold filtering at protein_fdr, then removes decoys and repairs indistinguishable-protein / protein-group / peptide-evidence references so the result stores as schema-valid, decoy-free idXML.
Shared by the file-based single-file search() above and the ProSE TOPP tool's single-file finalization, so the exact IDFilter sequence and ordering live in one place (they previously existed as two copies that could drift).
Precondition: protein_ids is non-empty and the set is statistically complete — picked-protein FDR does not compose across runs. The caller gates on "decoys present" and "FDR requested"; this helper additionally skips (with a warning) when no decoy proteins survive inference, to avoid target-only q-values.
| [in,out] | protein_ids | Protein identifications (operates on protein_ids[0]). |
| [in,out] | peptide_ids | PSMs feeding inference; decoy PSMs are removed. |
| [in] | decoy_string | Accession marker identifying decoy proteins. |
| [in] | decoy_is_prefix | Whether decoy_string is a prefix (true) or suffix (false). |
| [in] | protein_fdr | Picked-protein q-value threshold (expected > 0). |
|
protected |
Build a strided protein sample for chunked calibration.
Sample size is tied to database_chunk_size_ so the calibration FI never exceeds the user's declared memory budget. Crucial for immunopeptidomics (non-specific digestion) where a fixed-size 5000-protein sample would generate tens of GB of fragment index. Strided rather than first-N so small chunk_size values don't starve the calibration pool (#9182).
Note: very small chunk_size may still produce a calibration pool below calibration_min_psms_; calibration then no-ops silently (follow-up issue: pool PSMs across first-K chunks of the main search loop).
|
protected |
Build the searched database according to strategy.
Optionally strips pre-existing decoys and/or generates fresh decoys by reversing the target proteins. Produces FASTA entries only — does not construct a FragmentIndex.
|
staticprotected |
Helper: capture statistics that must be read BEFORE FDR filtering, namely target/decoy PSM counts (via the "target_decoy" meta set by PeptideIndexing; "target+decoy" counts as target, matching OpenMS FDR semantics) and the HyperScore distribution (FDR overwrites each hit's score with its q-value). Fills stats.target_psms, stats.decoy_psms, stats.hyperscore_* and stats.score_stats_valid.
|
protected |
Helper: fill a RunStatistics with identification counts, distributions and per-run tolerance estimation for one searched file (post-FDR). Silent; rendering is done separately (ProSE TOPP tool or renderRunSummary()). Fields filled at other points (target/decoy counts, achieved FDR, timing) are left untouched.
|
inlineprotected |
Scalar tolerance passed to OpenSearchModificationAnalysis under asymmetric bounds. Uses the tighter of the two positive magnitudes — semantically correct for UniMod Δmass matching precision. OpenSearchModificationAnalysis internally clamps this at MAX_MOD_MAPPING_TOL_ = 0.02 Da; see spec §7 for rationale.
Zero on one side is a legal one-sided window (e.g., [0, 500] Da = "search only positive mass shifts"). In that case std::min() would collapse to 0, passing a useless zero tolerance into the mod analyzer — masked in ppm mode by the internal clamp, but genuinely broken in Da mode. Fall back to the non-zero side so the mod-matching precision reflects the configured tolerance.
|
protected |
ProSE parameters made safe to hand to a FragmentIndex.
FragmentIndex declares its own (unused) boolean "decoys" flag with restrictions {true,false}. ProSE's "decoys" parameter is the richer auto/generate/ignore enum and manages decoys at the database level, so forwarding it verbatim would trip FragmentIndex's validation. This returns getParameters() with "decoys" overridden to "false".
|
inlineprivate |
Helper function to determine if open search should be used based on tolerance.
|
staticprotected |
Helper: maximum top-hit score among retained PSMs. After PSM-FDR the score is the q-value, so this is the achieved FDR of the filtered set (-1.0 if empty).
|
protected |
Filter and annotate search results.
Trims per-spectrum candidate hits to the top N and converts them into PeptideIdentification objects, adding requested PSM annotations and populating protein-level search metadata.
| [in] | exp | Input MS experiment providing spectra/metadata for annotation. |
| [in,out] | annotated_hits | Per-spectrum candidate hits (trimmed to top_hits in-place). |
| [out] | protein_ids | Output container for protein-level identification and search metadata. |
| [out] | peptide_ids | Output container for spectrum-level peptide identifications (PSMs). |
| [in] | top_hits | Number of top-scoring hits to retain per spectrum (report_top_hits_). |
| [in] | modifications_fixed | Fixed modifications (by name) used during the search. |
| [in] | modifications_variable | Variable modifications (by name) used during the search. |
| [in] | peptide_missed_cleavages | Allowed missed cleavages in digestion. |
| [in] | precursor_mass_tolerance | Precursor mass tolerance value. |
| [in] | fragment_mass_tolerance | Fragment mass tolerance value. |
| [in] | precursor_mass_tolerance_unit_ppm | Precursor tolerance unit ("true"->ppm, "false"->Da). |
| [in] | fragment_mass_tolerance_unit_ppm | Fragment tolerance unit ("true"->ppm, "false"->Da). |
| [in] | precursor_min_charge | Minimum precursor charge considered. |
| [in] | precursor_max_charge | Maximum precursor charge considered. |
| [in] | enzyme | Digestion enzyme name. |
| [out] | database_name | Database file name used for the search (stored in protein_ids). |
| SearchContext prepareContext | ( | const std::vector< FASTAFile::FASTAEntry > & | fasta_db | ) | const |
Build a SearchContext (decoy-augmented database + FragmentIndex) for reuse.
Performs the database preparation and FragmentIndex construction steps so that subsequent calls to search(spectra, ctx, ...) can reuse the same index across many spectrum files. If decoy generation is enabled (parameter "decoys"), decoys are generated and shuffled into the returned context's db member exactly once here.
| [in] | fasta_db | Protein sequence database as FASTA entries. |
Thread-safety: the returned context's FragmentIndex is read-only during subsequent search() calls; concurrent search() calls reading the same SearchContext are safe (per FragmentIndex query thread-safety contract). Do not call prepareContext() concurrently on the same algorithm instance.
|
staticprotected |
filter, deisotope, decharge spectra
|
static |
Render the modification-discovery section (top PTMs, unknown delta masses) to os. Shared by the per-file and aggregate report blocks.
|
static |
Render a human-readable single-run summary block (configuration recap, database/index stats, identification results, tolerance estimate, timing and — for open searches — modification discovery) to os. shared carries the configuration/db/index context (built once).
|
static |
Serialize the complete end-of-search report to a machine-readable YAML string: shared configuration/database/index facts, per-file identification statistics, the output-file manifest (label -> written paths) and failed/total file counts. Hand-rolled (no YAML library dependency); every string scalar is double-quoted so values containing ':' (e.g. Windows paths) or other YAML metacharacters round-trip safely, and non-finite numbers are emitted as null.
|
protected |
Decide how to obtain/recognise decoys for db.
Detects pre-existing decoys via the common-marker heuristic (OpenMS::DecoyHelper), falling back to the configured decoy_prefix for custom markers, then maps the "decoys" mode (auto/generate/ignore) onto a concrete DecoyStrategy_.
|
protected |
Run a fast calibration pass on a subset of spectra to estimate mass accuracy.
Scores a TIC-ranked subset of spectra against the fragment index, collects precursor and fragment mass errors from high-confidence PSMs, and returns calibrated tolerances using median + 3*MAD estimation.
| [in] | spectra | Preprocessed MS/MS spectra (subset is selected internally by TIC). |
| [in,out] | fragment_index | Pre-built fragment index for candidate lookup. |
| [in] | db | Protein database (for sequence reconstruction of candidates). |
|
protected |
Score all spectra against one FragmentIndex.
Shared by the non-chunked and chunked search paths. Appends per-scan AnnotatedHit_ entries to annotated_hits; does not prune or sort. Expects a pre-built FragmentIndex whose parameters already reflect any calibrated tolerances the caller wants to apply.
| ExitCodes search | ( | const std::string & | in_spectra, |
| const std::string & | in_db, | ||
| std::vector< ProteinIdentification > & | prot_ids, | ||
| PeptideIdentificationList & | pep_ids | ||
| ) | const |
Search spectra in a spectrum file (mzML or Bruker .d) against a protein database using an FI-backed workflow.
Populates protein and peptide identifications, including search meta data, PSM hits, and search engine annotations. Parameters are taken from this instance (DefaultParamHandler).
| [in] | in_spectra | Input path to the spectra file (mzML or Bruker .d) containing MS/MS spectra to search. |
| [in] | in_db | Input path to the protein sequence database in FASTA format. |
| [out] | prot_ids | Output container receiving search meta data and protein-level information. |
| [out] | pep_ids | Output container receiving spectrum-level peptide identifications (PSMs). |
Side effects:
Errors:
| ExitCodes search | ( | PeakMap & | spectra, |
| const std::vector< FASTAFile::FASTAEntry > & | fasta_db, | ||
| std::vector< ProteinIdentification > & | prot_ids, | ||
| PeptideIdentificationList & | pep_ids | ||
| ) | const |
In-memory search: search spectra against a protein database without file I/O.
Same as the file-based search() but takes pre-loaded spectra and FASTA entries directly. Spectra are preprocessed in-place (filtered, deisotoped, normalized).
| [in,out] | spectra | MS/MS spectra to search (preprocessed in-place). |
| [in] | fasta_db | Protein sequence database as FASTA entries. |
| [out] | prot_ids | Output protein-level identifications. |
| [out] | pep_ids | Output spectrum-level peptide identifications (PSMs). |
Internally this is a thin wrapper around prepareContext() + the context-taking search() overload, so the FragmentIndex is rebuilt on every call. For repeated searches against the same database, prefer calling prepareContext() once and reusing the resulting SearchContext.
| ExitCodes search | ( | PeakMap & | spectra, |
| SearchContext & | ctx, | ||
| std::vector< ProteinIdentification > & | prot_ids, | ||
| PeptideIdentificationList & | pep_ids | ||
| ) | const |
In-memory search using a pre-built SearchContext.
Searches spectra against the database and FragmentIndex held in ctx. The fragment index build cost (decoy generation, peptide/fragment generation, sorting, bucketing) is paid by prepareContext() and is not repeated here, making this overload the right choice when searching many spectrum files against the same database.
| [in,out] | spectra | MS/MS spectra to search (preprocessed in-place). |
| [in,out] | ctx | Pre-built SearchContext from prepareContext(). Taken by non-const reference because the underlying FragmentIndex query API is non-const, even though the index content is not modified during the search; the db member is also handed non-const to the downstream PeptideIndexing step (which requires a non-const reference). |
| [out] | prot_ids | Output protein-level identifications. |
| [out] | pep_ids | Output spectrum-level peptide identifications (PSMs). |
|
protected |
Chunked database search implementation.
Splits full_db into chunks of database_chunk_size_ proteins, builds a FragmentIndex per chunk, scores all spectra against each chunk, and accumulates hits before a single post-processing pass. full_db is the already-resolved database (decoys generated/stripped per strategy); strategy carries the effective decoy marker for PeptideIndexing and FDR. Taken by non-const reference because PeptideIndexing::run() mutates it.
| SearchResult searchWithModificationAnalysis | ( | const std::string & | in_spectra, |
| const std::string & | in_db, | ||
| const std::string & | output_base_name = "" |
||
| ) | const |
Search with comprehensive results including modification analysis tables.
This method performs a peptide database search and additionally returns structured modification analysis results for open search mode. This is the recommended method for modification discovery workflows.
The method automatically:
| in_spectra | Input path to the spectra file (mzML or Bruker .d) containing MS/MS spectra |
| in_db | Input path to the protein sequence database in FASTA format |
| output_base_name | Optional base name for output files (TSV tables) |
Example usage:
| MultiFileSearchResult searchWithModificationAnalysis | ( | const std::vector< std::string > & | in_spectra_files, |
| const std::string & | in_db, | ||
| const std::vector< std::string > & | output_base_names = {}, |
||
| const std::string & | aggregate_base_name = "" |
||
| ) | const |
Multi-file search with modification analysis (FASTA file path).
Convenience overload that loads the FASTA database from in_db and delegates to the in-memory multi-file overload. The database file path is recorded in each per-file ProteinIdentification's SearchParameters (and on the aggregate result).
| MultiFileSearchResult searchWithModificationAnalysis | ( | const std::vector< std::string > & | in_spectra_files, |
| const std::vector< FASTAFile::FASTAEntry > & | fasta_db, | ||
| const std::vector< std::string > & | output_base_names = {}, |
||
| const std::string & | aggregate_base_name = "" |
||
| ) | const |
Multi-file search with modification analysis (in-memory FASTA).
Builds a single SearchContext (decoy generation + FragmentIndex) from fasta_db and reuses it across all input spectrum files. Each input file produces its own SearchResult including a per-file modification analysis (TSV written if a non-empty per-file base name is provided). An additional aggregate SearchResult is computed by pooling all per-file peptide identifications and running modification analysis once on the pooled set.
| [in] | in_spectra_files | Spectrum file paths (mzML or Bruker .d). |
| [in] | fasta_db | Protein sequence database as FASTA entries. |
| [in] | output_base_names | Optional per-file base names for modification-analysis TSV outputs. Must be empty or have the same length as in_spectra_files. Empty entries skip TSV writing for that file. |
| [in] | aggregate_base_name | Optional base name for the aggregate modification-analysis TSV output. Empty disables aggregate TSV writing (the aggregate analysis is still computed). |
Errors:
output_base_names is non-empty and its size differs from in_spectra_files. | SearchResult searchWithModificationAnalysis | ( | PeakMap & | spectra, |
| const std::vector< FASTAFile::FASTAEntry > & | fasta_db, | ||
| const std::string & | output_base_name = "" |
||
| ) | const |
In-memory search with modification analysis: no file I/O required.
Same as the file-based searchWithModificationAnalysis() but takes pre-loaded data.
| [in,out] | spectra | MS/MS spectra (preprocessed in-place). |
| [in] | fasta_db | Protein sequence database as FASTA entries. |
| [in] | output_base_name | Optional base name for TSV output files. |
|
static |
Recompute result-level statistics (matched count, unique peptide/protein counts, charge & missed-cleavage histograms, target/decoy counts and achieved PSM FDR) from a FINAL, post-processed PSM list. Use this when the identifications are mutated AFTER search() returns — e.g. the ProSE TOPP tool's Percolator rescoring + deferred FDR — so the report reflects what the user actually receives. Does NOT touch ms2_spectra, the HyperScore distribution (captured pre-FDR during the search) or any timing field. fdr_applied records whether a PSM-level FDR filter was applied; when true achieved_psm_fdr is set to the maximum retained q-value.
|
overrideprotectedvirtual |
This method is used to update extra member variables at the end of the setParameters() method.
Also call it at the end of the derived classes' copy constructor and assignment operator.
The default implementation is empty.
Reimplemented from DefaultParamHandler.
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
0 = disabled; >0 = chunk DB into groups of this many proteins
|
protected |
|
protected |
|
protected |
Resolved MS2 deisotoping request (param fragment:deisotope != "false"). preprocessSpectra_ still gates on Deisotoper::isToleranceSupported() so the deisotoper is never called out of range (it would throw -> terminate in the OpenMP region). See OpenMS#9619.
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
mutableprotected |
Most recent calibration result (valid after any search that invoked runCalibrationPass_). Stored for test observability and diagnostics. Marked mutable because it is pure diagnostic/telemetry state that doesn't affect the logical const-ness of search().
|
mutableprotected |
Scalar tolerance passed to OpenSearchModificationAnalysis on the most recent search() call. Stored for test observability: because the calibration writeback restores the tolerance members on exit (to avoid per-file state leaks in the multi-file wrapper), tests that want to verify "the mod analyzer received the calibrated value, not the user-configured one" can't just read the members post-search — they need to see what was actually passed to the analyzer. Default -1.0 (sentinel: no search has run yet).
|
mutableprotected |
Per-run statistics of the most recent search(spectra, ctx, ...) call. Bridges the const search() (which returns ExitCodes, not a SearchResult) to its callers, which copy this into SearchResult::stats. Reset at the start of each search() call. mutable for the same reason as above: pure diagnostic state, orthogonal to logical const-ness.
|
protected |
|
protected |
|
protected |
|
protected |
NLargest cap on MS2 peaks before scoring; 0 = resolution-aware auto (peaks:keep_n)
|
protected |
WindowMower peaks-per-100Da before scoring (peaks:window_top)
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
protected |
|
mutableprotected |
|
protected |
|
mutableprotected |
positive magnitude (default matches the param)
|
protected |
|
protected |
|
protected |