Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema. More...

#include <OpenMS/FORMAT/QPXFile.h>

Static Public Member Functions
static std::shared_ptr< arrow::Table >	exportToArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false)
	Export PSMs to Arrow table using PSMSchema for lossless round-trips.

static std::shared_ptr< arrow::Table >	exportPSMsToQPXArrow (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, bool export_all_psms=false)
	Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema).

static bool	exportToParquet (const std::vector< ProteinIdentification > &protein_identifications, const PeptideIdentificationList &peptide_identifications, const std::string &filename, bool export_all_psms=false, const ParquetWriteConfig &config=ParquetWriteConfig{})
	Export PSM data to Parquet file.

static bool	exportToParquet (const std::shared_ptr< arrow::Table > &table, const std::string &filename, const ParquetWriteConfig &config=ParquetWriteConfig{})
	Write a pre-built QPX PSM Arrow table to a Parquet file.

static bool	exportToParquetStreaming (const std::vector< ProteinIdentification > &protein_identifications, const std::vector< const PeptideIdentification * > &peptide_identification_ptrs, const std::string &filename, bool export_all_psms=false, size_t batch_size=1000000, const ParquetWriteConfig &config=ParquetWriteConfig{}, int n_threads=1)
	Stream PSMs to a QPX Parquet file in row-batches to cap peak memory.

static bool	importFromArrow (const std::shared_ptr< arrow::Table > &table, std::vector< ProteinIdentification > &protein_identifications, PeptideIdentificationList &peptide_identifications)
	Import PSMs from a PSMSchema Arrow table.

Detailed Description

Export PSM (Peptide Spectrum Match) data to Apache Arrow format following QPX PSM schema.

This class provides static methods to export PeptideIdentification/ProteinIdentification data to Apache Arrow Tables and Parquet files. The schema follows the QPX (Quantitative Proteomics Exchange) PSM format.

Experimental classes:: This API is experimental and may change in future versions.

Member Function Documentation

◆ exportPSMsToQPXArrow()

static std::shared_ptr< arrow::Table > exportPSMsToQPXArrow	(	const std::vector< ProteinIdentification > &	protein_identifications,
		const PeptideIdentificationList &	peptide_identifications,
		bool	export_all_psms = `false`
	)

static

Export PSMs to QPX Parquet eXchange format Arrow table (QPXPSMSchema).

Unlike exportToArrow() which produces a PSMSchema table for lossless round-trips, this method produces a QPXPSMSchema table optimized for cross-tool exchange (quantms format).

Parameters

protein_identifications	Protein identifications (for file name lookup)
peptide_identifications	Peptide identifications to export
export_all_psms	If true, export all PSM hits; if false, only best hit per spectrum

Returns: Arrow table with QPXPSMSchema columns, or nullptr on failure

◆ exportToArrow()

static std::shared_ptr< arrow::Table > exportToArrow	(	const std::vector< ProteinIdentification > &	protein_identifications,
		const PeptideIdentificationList &	peptide_identifications,
		bool	export_all_psms = `false`
	)

static

Export PSMs to Arrow table using PSMSchema for lossless round-trips.

Produces a table with PSMSchema columns (score, score_type, rank, etc.) suitable for FeatureMapArrowIO and ConsensusMapArrowIO round-trips. For QPX exchange format output, use exportPSMsToQPXArrow() instead.

◆ exportToParquet() [1/2]

static bool exportToParquet	(	const std::shared_ptr< arrow::Table > &	table,
		const std::string &	filename,
		const ParquetWriteConfig &	config = `ParquetWriteConfig{}`
	)

static

Write a pre-built QPX PSM Arrow table to a Parquet file.

The table is expected to follow QPXPSMSchema (e.g., from exportPSMsToQPXArrow). Attaches QPX file metadata (qpx_version, file_type="psm", UUID, creation_date) before writing. Use this overload when the caller already has the table built (e.g., for merged output) to avoid rebuilding it.

Parameters

[in]	table	QPX PSM Arrow table (must not be null)
[in]	filename	Output file path
[in]	config	Parquet writing options

Returns: true on success, false on error

◆ exportToParquet() [2/2]

static bool exportToParquet	(	const std::vector< ProteinIdentification > &	protein_identifications,
		const PeptideIdentificationList &	peptide_identifications,
		const std::string &	filename,
		bool	export_all_psms = `false`,
		const ParquetWriteConfig &	config = `ParquetWriteConfig{}`
	)

static

Export PSM data to Parquet file.

Parameters

[in]	protein_identifications	Vector of protein identifications
[in]	peptide_identifications	List of peptide identifications
[in]	filename	Output file path
[in]	export_all_psms	If true, export all hits per spectrum (default: false, only best hit)
[in]	config	Parquet writing options

Returns: true on success, false on error

◆ exportToParquetStreaming()

static bool exportToParquetStreaming	(	const std::vector< ProteinIdentification > &	protein_identifications,
		const std::vector< const PeptideIdentification * > &	peptide_identification_ptrs,
		const std::string &	filename,
		bool	export_all_psms = `false`,
		size_t	batch_size = `1000000`,
		const ParquetWriteConfig &	config = `ParquetWriteConfig{}`,
		int	n_threads = `1`
	)

static

Stream PSMs to a QPX Parquet file in row-batches to cap peak memory.

Builds and flushes the QPXPSMSchema table in batches through a persistent Parquet writer instead of materializing the entire table in memory. Intended for very large inputs (e.g. millions of PSMs), where the one-shot exportToParquet would spike memory by holding all columns for all rows at once. Output is equivalent to exportToParquet (same schema and QPX metadata), but written as multiple Parquet row groups.

Parameters

[in]	protein_identifications	Protein identifications (for run-file-name lookup)
[in]	peptide_identification_ptrs	NON-OWNING pointers to the PSMs to export. The caller guarantees they outlive the call and are non-null.
[in]	filename	Output file path
[in]	export_all_psms	If true, export all hits per spectrum (default: best hit only)
[in]	batch_size	Number of PeptideIdentifications materialized into one in-memory Arrow table at a time. This is the peak-memory knob ONLY; it does not determine row-group count (with `export_all_psms` a single PeptideIdentification can emit several rows). 0 is treated as the default.
[in]	config	Parquet writing options. config.row_group_size is the maximum number of rows per Parquet row group (the WriteTable chunk size).
[in]	n_threads	OpenMP threads used to build each batch's partitions in parallel. 1 = serial (default, preserves prior behaviour); 0 = auto (all available cores, i.e. omp_get_max_threads(), which honours the OMP_NUM_THREADS environment variable); N = fixed count. The per-row build dominates export cost, so parallelism here is the main speedup. Output is identical in row content and order regardless of `n_threads` (contiguous partitions are written in index order). The Parquet write itself stays serial. Without OpenMP support the export always runs serially.

Note: On hyper-threaded CPUs, using all logical cores (0) can be slower than the physical-core count because the build is memory-bandwidth bound; set OMP_NUM_THREADS (or pass N) to the physical-core count for best throughput on such machines.

Returns: true on success, false on error (errors are logged)

◆ importFromArrow()

static bool importFromArrow	(	const std::shared_ptr< arrow::Table > &	table,
		std::vector< ProteinIdentification > &	protein_identifications,
		PeptideIdentificationList &	peptide_identifications
	)

static

Import PSMs from a PSMSchema Arrow table.

Reads PSMSchema-conformant rows and appends PeptideIdentifications to peptide_identifications. Each row's run_identifier column links PSMs back to the matching ProteinIdentification already present in protein_identifications by run identifier. If no match exists, a new ProteinIdentification shell is appended.

Parameters

[in]	table	PSMSchema Arrow table (must not be null)
[in,out]	protein_identifications	Existing protein identifications (used for higher_score_better lookup; new shells appended for unknown run_identifiers)
[in,out]	peptide_identifications	Peptide identifications appended to (caller may pass an empty or pre-populated list)

Returns: true on success, false on schema mismatch or unrecoverable error (errors are logged)

Static Public Member Functions

Detailed Description

Member Function Documentation

◆ exportPSMsToQPXArrow()

◆ exportToArrow()

◆ exportToParquet() [1/2]

◆ exportToParquet() [2/2]

◆ exportToParquetStreaming()

◆ importFromArrow()