![]() |
OpenMS
|
Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema. More...
#include <OpenMS/FORMAT/QPXFile.h>
Static Public Member Functions | |
| static std::shared_ptr< arrow::Table > | exportToArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false) |
| Export PSMs to Arrow table using PSMSchema for lossless round-trips. | |
| static std::shared_ptr< arrow::Table > | exportPSMsToQPXArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false) |
| Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema). | |
| static bool | exportToParquet (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, const std::string &filename, bool export_all_psms=false, const ParquetWriteConfig &config=ParquetWriteConfig{}) |
| Export PSM data to Parquet file. | |
| static bool | exportToParquet (const std::shared_ptr< arrow::Table > &table, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{}) |
| Write a pre-built QPX PSM Arrow table to a Parquet file. | |
| static bool | exportToParquetStreaming (const std::vector< ProteinIdentification > &protein_identifications, const std::vector< const PeptideIdentification * > &peptide_identification_ptrs, const std::string &filename, bool export_all_psms=false, size_t batch_size=1000000, const ParquetWriteConfig &config=ParquetWriteConfig{}, int n_threads=1) |
| Stream PSMs to a QPX Parquet file in row-batches to cap peak memory. | |
| static bool | importFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications, PeptideIdentificationList &peptide_identifications) |
| Import PSMs from a PSMSchema Arrow table. | |
Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema.
This class provides static methods to export PeptideIdentification/ProteinIdentification data to Apache Arrow Tables and Parquet files. The schema follows the QPX (Quantitative Proteomics Exchange) PSM format.
|
static |
Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema).
Unlike exportToArrow() which produces a PSMSchema table for lossless round-trips, this method produces a QPXPSMSchema table optimized for cross-tool exchange (quantms format).
| protein_identifications | Protein identifications (for file name lookup) |
| peptide_identifications | Peptide identifications to export |
| export_all_psms | If true, export all PSM hits; if false, only best hit per spectrum |
|
static |
Export PSMs to Arrow table using PSMSchema for lossless round-trips.
Produces a table with PSMSchema columns (score, score_type, rank, etc.) suitable for FeatureMapArrowIO and ConsensusMapArrowIO round-trips. For QPX exchange format output, use exportPSMsToQPXArrow() instead.
|
static |
Write a pre-built QPX PSM Arrow table to a Parquet file.
The table is expected to follow QPXPSMSchema (e.g., from exportPSMsToQPXArrow). Attaches QPX file metadata (qpx_version, file_type="psm", UUID, creation_date) before writing. Use this overload when the caller already has the table built (e.g., for merged output) to avoid rebuilding it.
| [in] | table | QPX PSM Arrow table (must not be null) |
| [in] | filename | Output file path |
| [in] | config | Parquet writing options |
|
static |
Export PSM data to Parquet file.
| [in] | protein_identifications | Vector of protein identifications |
| [in] | peptide_identifications | List of peptide identifications |
| [in] | filename | Output file path |
| [in] | export_all_psms | If true, export all hits per spectrum (default: false, only best hit) |
| [in] | config | Parquet writing options |
|
static |
Stream PSMs to a QPX Parquet file in row-batches to cap peak memory.
Builds and flushes the QPXPSMSchema table in batches through a persistent Parquet writer instead of materializing the entire table in memory. Intended for very large inputs (e.g. millions of PSMs), where the one-shot exportToParquet would spike memory by holding all columns for all rows at once. Output is equivalent to exportToParquet (same schema and QPX metadata), but written as multiple Parquet row groups.
| [in] | protein_identifications | Protein identifications (for run-file-name lookup) |
| [in] | peptide_identification_ptrs | NON-OWNING pointers to the PSMs to export. The caller guarantees they outlive the call and are non-null. |
| [in] | filename | Output file path |
| [in] | export_all_psms | If true, export all hits per spectrum (default: best hit only) |
| [in] | batch_size | Number of PeptideIdentifications materialized into one in-memory Arrow table at a time. This is the peak-memory knob ONLY; it does not determine row-group count (with export_all_psms a single PeptideIdentification can emit several rows). 0 is treated as the default. |
| [in] | config | Parquet writing options. config.row_group_size is the maximum number of rows per Parquet row group (the WriteTable chunk size). |
| [in] | n_threads | OpenMP threads used to build each batch's partitions in parallel. 1 = serial (default, preserves prior behaviour); 0 = auto (all available cores, i.e. omp_get_max_threads(), which honours the OMP_NUM_THREADS environment variable); N = fixed count. The per-row build dominates export cost, so parallelism here is the main speedup. Output is identical in row content and order regardless of n_threads (contiguous partitions are written in index order). The Parquet write itself stays serial. Without OpenMP support the export always runs serially. |
|
static |
Import PSMs from a PSMSchema Arrow table.
Reads PSMSchema-conformant rows and appends PeptideIdentifications to peptide_identifications. Each row's run_identifier column links PSMs back to the matching ProteinIdentification already present in protein_identifications by run identifier. If no match exists, a new ProteinIdentification shell is appended.
| [in] | table | PSMSchema Arrow table (must not be null) |
| [in,out] | protein_identifications | Existing protein identifications (used for higher_score_better lookup; new shells appended for unknown run_identifiers) |
| [in,out] | peptide_identifications | Peptide identifications appended to (caller may pass an empty or pre-populated list) |