OpenMS
Loading...
Searching...
No Matches
QPXFile Class Reference

Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema. More...

#include <OpenMS/FORMAT/QPXFile.h>

Static Public Member Functions

static std::shared_ptr< arrow::Table > exportToArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false)
 Export PSMs to Arrow table using PSMSchema for lossless round-trips.
 
static std::shared_ptr< arrow::Table > exportPSMsToQPXArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false)
 Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema).
 
static bool exportToParquet (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, const std::string &filename, bool export_all_psms=false, const ParquetWriteConfig &config=ParquetWriteConfig{})
 Export PSM data to Parquet file.
 
static bool exportToParquet (const std::shared_ptr< arrow::Table > &table, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{})
 Write a pre-built QPX PSM Arrow table to a Parquet file.
 
static bool exportToParquetStreaming (const std::vector< ProteinIdentification > &protein_identifications, const std::vector< const PeptideIdentification * > &peptide_identification_ptrs, const std::string &filename, bool export_all_psms=false, size_t batch_size=1000000, const ParquetWriteConfig &config=ParquetWriteConfig{}, int n_threads=1)
 Stream PSMs to a QPX Parquet file in row-batches to cap peak memory.
 
static bool importFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications, PeptideIdentificationList &peptide_identifications)
 Import PSMs from a PSMSchema Arrow table.
 

Detailed Description

Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema.

This class provides static methods to export PeptideIdentification/ProteinIdentification data to Apache Arrow Tables and Parquet files. The schema follows the QPX (Quantitative Proteomics Exchange) PSM format.

Experimental classes:
This API is experimental and may change in future versions.

Member Function Documentation

◆ exportPSMsToQPXArrow()

static std::shared_ptr< arrow::Table > exportPSMsToQPXArrow ( const std::vector< ProteinIdentification > &  protein_identifications,
const PeptideIdentificationList peptide_identifications,
bool  export_all_psms = false 
)
static

Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema).

Unlike exportToArrow() which produces a PSMSchema table for lossless round-trips, this method produces a QPXPSMSchema table optimized for cross-tool exchange (quantms format).

Parameters
protein_identificationsProtein identifications (for file name lookup)
peptide_identificationsPeptide identifications to export
export_all_psmsIf true, export all PSM hits; if false, only best hit per spectrum
Returns
Arrow table with QPXPSMSchema columns, or nullptr on failure

◆ exportToArrow()

static std::shared_ptr< arrow::Table > exportToArrow ( const std::vector< ProteinIdentification > &  protein_identifications,
const PeptideIdentificationList peptide_identifications,
bool  export_all_psms = false 
)
static

Export PSMs to Arrow table using PSMSchema for lossless round-trips.

Produces a table with PSMSchema columns (score, score_type, rank, etc.) suitable for FeatureMapArrowIO and ConsensusMapArrowIO round-trips. For QPX exchange format output, use exportPSMsToQPXArrow() instead.

◆ exportToParquet() [1/2]

static bool exportToParquet ( const std::shared_ptr< arrow::Table > &  table,
const std::string &  filename,
const ParquetWriteConfig config = ParquetWriteConfig{} 
)
static

Write a pre-built QPX PSM Arrow table to a Parquet file.

The table is expected to follow QPXPSMSchema (e.g., from exportPSMsToQPXArrow). Attaches QPX file metadata (qpx_version, file_type="psm", UUID, creation_date) before writing. Use this overload when the caller already has the table built (e.g., for merged output) to avoid rebuilding it.

Parameters
[in]tableQPX PSM Arrow table (must not be null)
[in]filenameOutput file path
[in]configParquet writing options
Returns
true on success, false on error

◆ exportToParquet() [2/2]

static bool exportToParquet ( const std::vector< ProteinIdentification > &  protein_identifications,
const PeptideIdentificationList peptide_identifications,
const std::string &  filename,
bool  export_all_psms = false,
const ParquetWriteConfig config = ParquetWriteConfig{} 
)
static

Export PSM data to Parquet file.

Parameters
[in]protein_identificationsVector of protein identifications
[in]peptide_identificationsList of peptide identifications
[in]filenameOutput file path
[in]export_all_psmsIf true, export all hits per spectrum (default: false, only best hit)
[in]configParquet writing options
Returns
true on success, false on error

◆ exportToParquetStreaming()

static bool exportToParquetStreaming ( const std::vector< ProteinIdentification > &  protein_identifications,
const std::vector< const PeptideIdentification * > &  peptide_identification_ptrs,
const std::string &  filename,
bool  export_all_psms = false,
size_t  batch_size = 1000000,
const ParquetWriteConfig config = ParquetWriteConfig{},
int  n_threads = 1 
)
static

Stream PSMs to a QPX Parquet file in row-batches to cap peak memory.

Builds and flushes the QPXPSMSchema table in batches through a persistent Parquet writer instead of materializing the entire table in memory. Intended for very large inputs (e.g. millions of PSMs), where the one-shot exportToParquet would spike memory by holding all columns for all rows at once. Output is equivalent to exportToParquet (same schema and QPX metadata), but written as multiple Parquet row groups.

Parameters
[in]protein_identificationsProtein identifications (for run-file-name lookup)
[in]peptide_identification_ptrsNON-OWNING pointers to the PSMs to export. The caller guarantees they outlive the call and are non-null.
[in]filenameOutput file path
[in]export_all_psmsIf true, export all hits per spectrum (default: best hit only)
[in]batch_sizeNumber of PeptideIdentifications materialized into one in-memory Arrow table at a time. This is the peak-memory knob ONLY; it does not determine row-group count (with export_all_psms a single PeptideIdentification can emit several rows). 0 is treated as the default.
[in]configParquet writing options. config.row_group_size is the maximum number of rows per Parquet row group (the WriteTable chunk size).
[in]n_threadsOpenMP threads used to build each batch's partitions in parallel. 1 = serial (default, preserves prior behaviour); 0 = auto (all available cores, i.e. omp_get_max_threads(), which honours the OMP_NUM_THREADS environment variable); N = fixed count. The per-row build dominates export cost, so parallelism here is the main speedup. Output is identical in row content and order regardless of n_threads (contiguous partitions are written in index order). The Parquet write itself stays serial. Without OpenMP support the export always runs serially.
Note
On hyper-threaded CPUs, using all logical cores (0) can be slower than the physical-core count because the build is memory-bandwidth bound; set OMP_NUM_THREADS (or pass N) to the physical-core count for best throughput on such machines.
Returns
true on success, false on error (errors are logged)

◆ importFromArrow()

static bool importFromArrow ( const std::shared_ptr< arrow::Table > &  table,
std::vector< ProteinIdentification > &  protein_identifications,
PeptideIdentificationList peptide_identifications 
)
static

Import PSMs from a PSMSchema Arrow table.

Reads PSMSchema-conformant rows and appends PeptideIdentifications to peptide_identifications. Each row's run_identifier column links PSMs back to the matching ProteinIdentification already present in protein_identifications by run identifier. If no match exists, a new ProteinIdentification shell is appended.

Parameters
[in]tablePSMSchema Arrow table (must not be null)
[in,out]protein_identificationsExisting protein identifications (used for higher_score_better lookup; new shells appended for unknown run_identifiers)
[in,out]peptide_identificationsPeptide identifications appended to (caller may pass an empty or pre-populated list)
Returns
true on success, false on schema mismatch or unrecoverable error (errors are logged)