OpenMS
Loading...
Searching...
No Matches
XICParquetFile Class Reference

Reader for OpenSWATH chromatogram Parquet files (.xic). More...

#include <OpenMS/FORMAT/XICParquetFile.h>

Collaboration diagram for XICParquetFile:
[legend]

Classes

struct  XICAnalyte
 Analyte metadata container. More...
 
struct  XICChromatogram
 Lightweight chromatogram container for XIC parquet rows. More...
 
struct  XICRunInfo
 Unique run information (run_id, source_file). More...
 

Public Member Functions

 XICParquetFile (const String &filename)
 Construct from a single .xic file.
 
 XICParquetFile (const std::vector< String > &filenames)
 Construct from multiple .xic files.
 
 XICParquetFile (const XICParquetFile &rhs)=default
 
XICParquetFileoperator= (const XICParquetFile &rhs)=default
 
const StringgetFilename () const
 Return the primary filename.
 
const std::vector< String > & getFilenames () const
 Return all filenames associated with this instance.
 
void load (std::vector< XICChromatogram > &output) const
 Load all chromatograms from the file(s).
 
void getChromatograms (std::vector< XICChromatogram > &output, Int64 precursor_id=-1, Int64 transition_id=-1, const String &modified_sequence="", Int64 precursor_charge=-1, Int64 product_charge=-1, Int64 ms_level=-1, Int64 run_id=-1, const String &filter="") const
 Load chromatograms with optional filtering.
 
void getChromatograms (std::vector< XICChromatogram > &output, const ParquetFilter &filter) const
 Return chromatograms using a typed filter expression.
 
void getChromatograms (std::vector< XICChromatogram > &output, const ParquetFilterBuilder &filter) const
 Return chromatograms using a typed filter builder.
 
void getRuns (std::vector< XICRunInfo > &output) const
 Return unique run metadata (run_id, source_file).
 
void getAnalytes (std::vector< XICAnalyte > &output, const std::vector< String > &columns={}, bool nest_transitions=true) const
 Return unique analyte metadata.
 
void getColumns (std::vector< String > &output) const
 Return the parquet schema column names.
 

Private Member Functions

void getChromatograms_ (std::vector< XICChromatogram > &output, const FilterExpression &extra_filter, Int64 precursor_id, Int64 transition_id, const String &modified_sequence, Int64 precursor_charge, Int64 product_charge, Int64 ms_level, Int64 run_id, const String &filter) const
 

Private Attributes

String filename_
 
std::vector< Stringfilenames_
 

Detailed Description

Reader for OpenSWATH chromatogram Parquet files (.xic).

Supports loading single or multiple files and filtering on metadata columns (e.g., precursor id, transition id, annotations). Filters are applied before decoding RT/intensity binary arrays.

Filter syntax

The filter argument in getChromatograms() accepts simple boolean expressions over column names. Supported operators are:

  • Comparison: =, ==, !=, <, <=, >, >=
  • Set membership: in [v1, v2, ...]
  • Boolean: AND/OR (also accepts &&, ||, &, |)

Values can be integers or strings; strings may be unquoted if they contain no spaces or commas (e.g., annotation=y3^1), otherwise use quotes.

Supported filter columns (case-insensitive): RUN_ID, SOURCE_FILE, MS_LEVEL, PRECURSOR_ID, TRANSITION_ID, MODIFIED_SEQUENCE, PRECURSOR_CHARGE, PRODUCT_CHARGE, DETECTING_TRANSITION, PRECURSOR_DECOY, PRODUCT_DECOY, TRANSITION_ORDINAL, TRANSITION_TYPE, ANNOTATION. RT and INTENSITY are not filterable because they are stored as compressed binary arrays.

Internal processing notes

The implementation uses an Arrow-based pipeline:

  • If Arrow Dataset is available, filters are translated into Arrow expressions and pushed down via dataset scanning.
  • If dataset filtering is unavailable or fails, the same filter expression is evaluated in-memory using Arrow compute.
  • RT/intensity binary arrays are decoded only after filtering.

These steps are implemented in helper functions in the corresponding .cpp file (e.g., dataset scan vs. compute filter fallback and filter parsing). Keeping the helpers in the implementation file avoids exposing Arrow types in the public header.

Note
The .xic schema is defined by MSChromatogramParquetConsumer.
See also
OpenMS::MSChromatogramParquetConsumer

Class Documentation

◆ OpenMS::XICParquetFile::XICAnalyte

struct OpenMS::XICParquetFile::XICAnalyte

Analyte metadata container.

If nest_transitions is false in getAnalytes(), transition-level fields are stored in the scalar members (transition_id, product_charge, etc.). If nest_transitions is true, transition-level fields are stored in the vector members (transition_ids, product_charges, etc.), with one entry per unique transition belonging to the precursor.

Collaboration diagram for XICParquetFile::XICAnalyte:
[legend]
Class Members
String annotation
vector< String > annotations
Int64 detecting_transition {0}
vector< Int64 > detecting_transitions
bool has_detecting_transition {false}
bool has_precursor_charge {false}
bool has_precursor_decoy {false}
bool has_precursor_id {false}
bool has_product_charge {false}
bool has_product_decoy {false}
bool has_transition_id {false}
bool has_transition_ordinal {false}
String modified_sequence
Int64 precursor_charge {0}
Int64 precursor_decoy {0}
Int64 precursor_id {0}
Int64 product_charge {0}
vector< Int64 > product_charges
Int64 product_decoy {0}
vector< Int64 > product_decoys
Int64 transition_id {0}
vector< Int64 > transition_ids
Int64 transition_ordinal {0}
vector< Int64 > transition_ordinals
String transition_type
vector< String > transition_types

◆ OpenMS::XICParquetFile::XICChromatogram

struct OpenMS::XICParquetFile::XICChromatogram

Lightweight chromatogram container for XIC parquet rows.

Collaboration diagram for XICParquetFile::XICChromatogram:
[legend]
Class Members
String annotation
Int64 detecting_transition {0}
bool has_detecting_transition {false}
bool has_precursor_charge {false}
bool has_precursor_decoy {false}
bool has_precursor_id {false}
bool has_product_charge {false}
bool has_product_decoy {false}
bool has_transition_id {false}
bool has_transition_ordinal {false}
vector< double > intensity
String modified_sequence
Int64 ms_level {0}
Int64 precursor_charge {0}
Int64 precursor_decoy {0}
Int64 precursor_id {0}
Int64 product_charge {0}
Int64 product_decoy {0}
vector< double > rt
Int64 run_id {0}
String source_file
Int64 transition_id {0}
Int64 transition_ordinal {0}
String transition_type

◆ OpenMS::XICParquetFile::XICRunInfo

struct OpenMS::XICParquetFile::XICRunInfo

Unique run information (run_id, source_file).

Collaboration diagram for XICParquetFile::XICRunInfo:
[legend]
Class Members
Int64 run_id {0}
String source_file

Constructor & Destructor Documentation

◆ XICParquetFile() [1/3]

XICParquetFile ( const String filename)
explicit

Construct from a single .xic file.

Parameters
[in]filenamePath to an OpenSWATH chromatogram parquet file.

◆ XICParquetFile() [2/3]

XICParquetFile ( const std::vector< String > &  filenames)
explicit

Construct from multiple .xic files.

Parameters
[in]filenamesPaths to OpenSWATH chromatogram parquet files.

◆ XICParquetFile() [3/3]

XICParquetFile ( const XICParquetFile rhs)
default

Member Function Documentation

◆ getAnalytes()

void getAnalytes ( std::vector< XICAnalyte > &  output,
const std::vector< String > &  columns = {},
bool  nest_transitions = true 
) const

Return unique analyte metadata.

If nest_transitions is false, each row represents a unique precursor-transition pair. If nest_transitions is true, each row represents a unique precursor with transition-level fields aggregated into vectors.

This method never decodes RT/intensity arrays and always returns distinct entries.

Parameters
[out]outputOutput analyte metadata
[in]columnsOptional list of analyte columns to return (empty for defaults)
[in]nest_transitionsAggregate transition fields per precursor

◆ getChromatograms() [1/3]

void getChromatograms ( std::vector< XICChromatogram > &  output,
const ParquetFilter filter 
) const

Return chromatograms using a typed filter expression.

Parameters
[out]outputOutput chromatograms
[in]filterTyped filter builder expression

◆ getChromatograms() [2/3]

void getChromatograms ( std::vector< XICChromatogram > &  output,
const ParquetFilterBuilder filter 
) const

Return chromatograms using a typed filter builder.

Parameters
[out]outputOutput chromatograms
[in]filterTyped filter builder

◆ getChromatograms() [3/3]

void getChromatograms ( std::vector< XICChromatogram > &  output,
Int64  precursor_id = -1,
Int64  transition_id = -1,
const String modified_sequence = "",
Int64  precursor_charge = -1,
Int64  product_charge = -1,
Int64  ms_level = -1,
Int64  run_id = -1,
const String filter = "" 
) const

Load chromatograms with optional filtering.

Parameters
[out]outputOutput chromatograms
[in]precursor_idOptional precursor id (-1 to ignore)
[in]transition_idOptional transition id (-1 to ignore)
[in]modified_sequenceOptional sequence filter (empty to ignore)
[in]precursor_chargeOptional charge filter (-1 to ignore)
[in]product_chargeOptional product charge filter (-1 to ignore)
[in]ms_levelOptional MS level filter (-1 to ignore)
[in]run_idOptional run_id filter (-1 to ignore)
[in]filterOptional filter expression on columns (e.g., "PRECURSOR_ID=1 OR TRANSITION_ID in [2,3]")

◆ getChromatograms_()

void getChromatograms_ ( std::vector< XICChromatogram > &  output,
const FilterExpression extra_filter,
Int64  precursor_id,
Int64  transition_id,
const String modified_sequence,
Int64  precursor_charge,
Int64  product_charge,
Int64  ms_level,
Int64  run_id,
const String filter 
) const
private

◆ getColumns()

void getColumns ( std::vector< String > &  output) const

Return the parquet schema column names.

Parameters
[out]outputColumn names.

◆ getFilename()

const String & getFilename ( ) const

Return the primary filename.

For multi-file instances this is the first file in the list.

Returns
Primary filename.

◆ getFilenames()

const std::vector< String > & getFilenames ( ) const

Return all filenames associated with this instance.

Returns
All filenames associated with this instance.

◆ getRuns()

void getRuns ( std::vector< XICRunInfo > &  output) const

Return unique run metadata (run_id, source_file).

This method never decodes RT/intensity arrays and always returns distinct rows.

◆ load()

void load ( std::vector< XICChromatogram > &  output) const

Load all chromatograms from the file(s).

Parameters
[out]outputOutput chromatograms.

◆ operator=()

XICParquetFile & operator= ( const XICParquetFile rhs)
default

Member Data Documentation

◆ filename_

String filename_
private

◆ filenames_

std::vector<String> filenames_
private