OpenMS
Loading...
Searching...
No Matches
UniProtXMLHandler Class Reference

SAX handler for UniProtKB XML <entry> documents. More...

#include <OpenMS/FORMAT/HANDLERS/UniProtXMLHandler.h>

Inheritance diagram for UniProtXMLHandler:
[legend]
Collaboration diagram for UniProtXMLHandler:
[legend]

Public Types

enum class  CaptureTarget {
  None , Accession , EntryName , FullName ,
  PrimaryGene , TaxName , Sequence , FeatureOriginal ,
  FeatureVariation
}
 
using EntryCallback = std::function< void(UniProtEntry &&)>
 
- Public Types inherited from XMLHandler
enum  ActionMode { LOAD , STORE }
 Action to set the current mode (for error messages) More...
 
enum  LOADDETAIL { LD_ALLDATA , LD_RAWCOUNTS , LD_COUNTS_WITHOPTIONS }
 

Public Member Functions

 UniProtXMLHandler (const std::string &filename, EntryCallback callback)
 Build a handler that delivers each parsed UniProtEntry to callback.
 
 ~UniProtXMLHandler () override
 Destructor.
 
void startElement (const XMLCh *const uri, const XMLCh *const local_name, const XMLCh *const qname, const xercesc::Attributes &attrs) override
 
void endElement (const XMLCh *const uri, const XMLCh *const local_name, const XMLCh *const qname) override
 
void characters (const XMLCh *const chars, const XMLSize_t length) override
 
- Public Member Functions inherited from XMLHandler
 XMLHandler (const std::string &filename, const std::string &version)
 Default constructor.
 
 ~XMLHandler () override
 Destructor.
 
void reset ()
 Release internal memory used for parsing (call.
 
void fatalError (const xercesc::SAXParseException &exception) override
 
void error (const xercesc::SAXParseException &exception) override
 
void warning (const xercesc::SAXParseException &exception) override
 
void fatalError (ActionMode mode, const std::string &msg, UInt line=0, UInt column=0) const
 Fatal error handler. Throws a ParseError exception.
 
void error (ActionMode mode, const std::string &msg, UInt line=0, UInt column=0) const
 Error handler for recoverable errors.
 
void warning (ActionMode mode, const std::string &msg, UInt line=0, UInt column=0) const
 Warning handler.
 
void characters (const XMLCh *const chars, const XMLSize_t length) override
 Parsing method for character data.
 
void startElement (const XMLCh *const uri, const XMLCh *const localname, const XMLCh *const qname, const xercesc::Attributes &attrs) override
 Parsing method for opening tags.
 
void endElement (const XMLCh *const uri, const XMLCh *const localname, const XMLCh *const qname) override
 Parsing method for closing tags.
 
virtual void writeTo (std::ostream &)
 Writes the contents to a stream.
 
virtual LOADDETAIL getLoadDetail () const
 handler which support partial loading, implement this method
 
virtual void setLoadDetail (const LOADDETAIL d)
 handler which support partial loading, implement this method
 
DataValue cvParamToValue (const ControlledVocabulary &cv, const std::string &parent_tag, const std::string &accession, const std::string &name, const std::string &value, const std::string &unit_accession) const
 Convert the value of a <cvParam value=.> (as commonly found in PSI schemata) to the DataValue with the correct type (e.g. int) according to the type stored in the CV (usually PSI-MS CV), as well as set its unit.
 
DataValue cvParamToValue (const ControlledVocabulary &cv, const CVTerm &raw_term) const
 Convert the value of a <cvParam value=.> (as commonly found in PSI schemata) to the DataValue with the correct type (e.g. int) according to the type stored in the CV (usually PSI-MS CV), as well as set its unit.
 
void checkUniqueIdentifiers_ (const std::vector< ProteinIdentification > &prot_ids) const
 

Private Member Functions

void resetEntry_ ()
 Clear all per-entry state so the next <entry> starts fresh.
 
void resetFeature_ ()
 Clear all per-feature state so the next <feature> starts fresh.
 

Static Private Member Functions

static int parsePosition_ (const std::string &attr)
 

Private Attributes

EntryCallback callback_
 
UniProtEntry current_entry_
 
int depth_ {0}
 Depth of the most recent element start, used to scope subtree handling.
 
int entry_depth_ {0}
 <entry>
 
int recommended_name_depth_ {0}
 <protein>/<recommendedName>
 
int gene_depth_ {0}
 <gene>
 
int organism_depth_ {0}
 <organism>
 
int alt_products_depth_ {0}
 <comment type="alternative products"> (whole subtree skipped)
 
int feature_depth_ {0}
 <feature>
 
int sequence_depth_ {0}
 <sequence length="..."> (only the canonical sequence; isoform <sequence> ignored)
 
bool full_name_captured_ {false}
 
std::string char_buf_
 Buffer for character data; appended-to in characters(), consumed in endElement().
 
CaptureTarget capture_ {CaptureTarget::None}
 Destination for the next buffered run of character data (None == not capturing).
 
UniProtFeature current_feature_
 Per-feature working state (used between startElement("feature") and endElement("feature")).
 
bool gene_name_is_primary_ {false}
 Whether the next <name> encountered inside the current <gene> subtree is a primary name.
 
bool organism_name_is_scientific_ {false}
 Whether the next <name> encountered inside the current <organism> subtree is the scientific name.
 

Additional Inherited Members

- Static Public Member Functions inherited from XMLHandler
static std::string writeXMLEscape (const std::string &to_escape)
 Escapes a string and returns the escaped string.
 
static DataValue fromXSDString (const std::string &type, const std::string &value)
 Convert an XSD type (e.g. 'xsd:double') to a DataValue.
 
- Protected Member Functions inherited from XMLHandler
void writeUserParam_ (const std::string &tag_name, std::ostream &os, const MetaInfoInterface &meta, UInt indent) const
 Writes the content of MetaInfoInterface to the file.
 
Int asInt_ (const std::string &in) const
 Conversion of a std::string to an integer value.
 
Int asInt_ (const XMLCh *in) const
 Conversion of a Xerces string to an integer value.
 
UInt asUInt_ (const std::string &in) const
 Conversion of a std::string to an unsigned integer value.
 
double asDouble_ (const std::string &in) const
 Conversion of a std::string to a double value.
 
float asFloat_ (const std::string &in) const
 Conversion of a std::string to a float value.
 
bool asBool_ (const std::string &in) const
 Conversion of a string to a boolean value.
 
DateTime asDateTime_ (std::string date_string) const
 Conversion of a xs:datetime string to a DateTime value.
 
bool equal_ (const XMLCh *a, const XMLCh *b) const
 Returns if two Xerces strings are equal.
 
SignedSize cvStringToEnum_ (const Size section, const std::string &term, const char *message, const SignedSize result_on_error=0)
 
std::string attributeAsString_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to a String.
 
Int attributeAsInt_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to a Int.
 
double attributeAsDouble_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to a double.
 
DoubleList attributeAsDoubleList_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to a DoubleList.
 
IntList attributeAsIntList_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to an IntList.
 
StringList attributeAsStringList_ (const xercesc::Attributes &a, const char *name) const
 Converts an attribute to an StringList.
 
bool optionalAttributeAsString_ (std::string &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the String value if the attribute is present.
 
bool optionalAttributeAsInt_ (Int &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the Int value if the attribute is present.
 
bool optionalAttributeAsUInt_ (UInt &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the UInt value if the attribute is present.
 
bool optionalAttributeAsDouble_ (double &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the double value if the attribute is present.
 
bool optionalAttributeAsDoubleList_ (DoubleList &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the DoubleList value if the attribute is present.
 
bool optionalAttributeAsStringList_ (StringList &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the StringList value if the attribute is present.
 
bool optionalAttributeAsIntList_ (IntList &value, const xercesc::Attributes &a, const char *name) const
 Assigns the attribute content to the IntList value if the attribute is present.
 
std::string attributeAsString_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a String.
 
Int attributeAsInt_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a Int.
 
double attributeAsDouble_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a double.
 
DoubleList attributeAsDoubleList_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a DoubleList.
 
IntList attributeAsIntList_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a IntList.
 
StringList attributeAsStringList_ (const xercesc::Attributes &a, const XMLCh *name) const
 Converts an attribute to a StringList.
 
bool optionalAttributeAsString_ (std::string &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the String value if the attribute is present.
 
bool optionalAttributeAsInt_ (Int &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the Int value if the attribute is present.
 
bool optionalAttributeAsUInt_ (UInt &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the UInt value if the attribute is present.
 
bool optionalAttributeAsDouble_ (double &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the double value if the attribute is present.
 
bool optionalAttributeAsDoubleList_ (DoubleList &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the DoubleList value if the attribute is present.
 
bool optionalAttributeAsIntList_ (IntList &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the IntList value if the attribute is present.
 
bool optionalAttributeAsStringList_ (StringList &value, const xercesc::Attributes &a, const XMLCh *name) const
 Assigns the attribute content to the StringList value if the attribute is present.
 
- Protected Attributes inherited from XMLHandler
std::string file_
 File name.
 
std::string version_
 Schema version.
 
StringManager sm_
 Helper class for string conversion.
 
std::vector< std::string > open_tags_
 Stack of open XML tags.
 
LOADDETAIL load_detail_
 parse only until total number of scans and chroms have been determined from attributes
 
std::vector< std::vector< std::string > > cv_terms_
 Array of CV term lists (one sublist denotes one term and it's children)
 

Detailed Description

SAX handler for UniProtKB XML <entry> documents.

Streams individual <entry> elements into UniProtEntry POD instances and delivers each via a callback at </entry>. Implements the same scoping rules as the upstream C# UniPEFF tool: isoform sequences (missing the "length" attribute on <sequence>) and the <comment type="alternative products"> subtree are skipped, <gene> and <organism> are scanned via bounded depth tracking so their inner <name> elements never leak into the top-level "entry mnemonic" capture, and <feature> contents are buffered and classified at the closing tag rather than incrementally.

Member Typedef Documentation

◆ EntryCallback

using EntryCallback = std::function<void(UniProtEntry&&)>

Callback invoked once per </entry>: receives ownership of the populated UniProtEntry. The handler resets its working state immediately after, so the callback is the only place the entry is reachable.

Member Enumeration Documentation

◆ CaptureTarget

enum class CaptureTarget
strong

Where the next character-data run, if buffered, should be deposited at the closing tag of its enclosing element.

Enumerator
None 
Accession 
EntryName 
FullName 
PrimaryGene 
TaxName 
Sequence 
FeatureOriginal 
FeatureVariation 

Constructor & Destructor Documentation

◆ UniProtXMLHandler()

UniProtXMLHandler ( const std::string &  filename,
EntryCallback  callback 
)

Build a handler that delivers each parsed UniProtEntry to callback.

◆ ~UniProtXMLHandler()

~UniProtXMLHandler ( )
override

Destructor.

Member Function Documentation

◆ characters()

void characters ( const XMLCh *const  chars,
const XMLSize_t  length 
)
override

◆ endElement()

void endElement ( const XMLCh *const  uri,
const XMLCh *const  local_name,
const XMLCh *const  qname 
)
override

◆ parsePosition_()

static int parsePosition_ ( const std::string &  attr)
staticprivate

Parse a UniProt position attribute string; returns 0 when the attribute is absent or non-numeric (UniProt encodes "unknown" by omitting the attribute).

◆ resetEntry_()

void resetEntry_ ( )
private

Clear all per-entry state so the next <entry> starts fresh.

◆ resetFeature_()

void resetFeature_ ( )
private

Clear all per-feature state so the next <feature> starts fresh.

◆ startElement()

void startElement ( const XMLCh *const  uri,
const XMLCh *const  local_name,
const XMLCh *const  qname,
const xercesc::Attributes &  attrs 
)
override

Member Data Documentation

◆ alt_products_depth_

int alt_products_depth_ {0}
private

<comment type="alternative products"> (whole subtree skipped)

◆ callback_

EntryCallback callback_
private

◆ capture_

CaptureTarget capture_ {CaptureTarget::None}
private

Destination for the next buffered run of character data (None == not capturing).

◆ char_buf_

std::string char_buf_
private

Buffer for character data; appended-to in characters(), consumed in endElement().

◆ current_entry_

UniProtEntry current_entry_
private

◆ current_feature_

UniProtFeature current_feature_
private

Per-feature working state (used between startElement("feature") and endElement("feature")).

◆ depth_

int depth_ {0}
private

Depth of the most recent element start, used to scope subtree handling.

◆ entry_depth_

int entry_depth_ {0}
private

<entry>

Subtree gates. Each holds the depth at which the corresponding element opened so we know when its closing tag fires (depth_ == gate value during endElement). 0 means "not currently inside this subtree".

◆ feature_depth_

int feature_depth_ {0}
private

<feature>

◆ full_name_captured_

bool full_name_captured_ {false}
private

Captures the <fullName> the first time we encounter it inside <protein>/<recommendedName>; later <fullName> elements (e.g. alternative names) are skipped.

◆ gene_depth_

int gene_depth_ {0}
private

<gene>

◆ gene_name_is_primary_

bool gene_name_is_primary_ {false}
private

Whether the next <name> encountered inside the current <gene> subtree is a primary name.

◆ organism_depth_

int organism_depth_ {0}
private

<organism>

◆ organism_name_is_scientific_

bool organism_name_is_scientific_ {false}
private

Whether the next <name> encountered inside the current <organism> subtree is the scientific name.

◆ recommended_name_depth_

int recommended_name_depth_ {0}
private

<protein>/<recommendedName>

◆ sequence_depth_

int sequence_depth_ {0}
private

<sequence length="..."> (only the canonical sequence; isoform <sequence> ignored)