OpenMS
2.8.0
|
An Aho Corasick trie (a set of nodes with suffix links mainly) More...
#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
Public Member Functions | |
ACTrie (uint32_t max_aaa=0, uint32_t max_mm=0) | |
Default C'tor which just creates a root node. More... | |
~ACTrie () | |
D'tor. More... | |
void | addNeedle (const std::string &needle) |
void | addNeedles (const std::vector< std::string > &needles) |
void | addNeedlesAndCompress (const std::vector< std::string > &needles) |
void | compressTrie () |
Traverses the trie in BFS order and makes it more compact and efficient to traverse. More... | |
size_t | getNeedleCount () const |
How many needles were added to the trie? More... | |
void | setMaxAAACount (const uint32_t max_aaa) |
uint32_t | getMaxAAACount () const |
Maximum number of ambiguous amino acids allowed during search. More... | |
void | setMaxMMCount (const uint32_t max_mm) |
uint32_t | getMaxMMCount () const |
Maximum number of mismatches allowed during search. More... | |
bool | nextHits (ACTrieState &state) const |
void | getAllHits (ACTrieState &state) const |
Private Member Functions | |
bool | nextHitsNoClear_ (ACTrieState &state) const |
Index | add_ (const Index from, const AA edge) |
bool | addHits_ (Index i, const size_t text_pos, std::vector< Hit > &hits) const |
Add all hits occurring in node i (including all its suffix hits) More... | |
bool | addHitsSpawn_ (Index i, const ACSpawn &spawn, const size_t text_pos, std::vector< Hit > &hits, const int current_spawn_depths) const |
same as addHits_, but only follows the suffix chain until the spawn loses its prefix More... | |
Index | follow_ (const Index i, const AA edge) const |
bool | followSpawn_ (ACSpawn &spawn, const AA edge, ACTrieState &state) const |
Index | stepMaster_ (const Index i, const AA edge, ACTrieState &state) const |
bool | stepSpawn_ (ACSpawn &spawn, ACTrieState &state) const |
void | createSpawns_ (const Index i, const AA fromAA, const AA toAA, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const |
void | createSubSpawns_ (const ACSpawn &prototype, const AA fromAA, const AA toAA, ACTrieState &state) const |
Create spawns from a spawn with an AAA or MM, using prototype as template, following edges in range fromAA to toAA . More... | |
void | createMMSpawns_ (const Index i, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state, const uint32_t aaa_left, const uint32_t mm_left) const |
Same as createSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More... | |
void | createMMSubSpawns_ (const ACSpawn &prototype, const AA except_fromAA, const AA except_toAA, const AA except_edge, ACTrieState &state) const |
Same as createSubSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA to except_toAA and the except_edge itself. More... | |
Index | findChildNaive_ (Index parent, AA child_label) |
During needle addition (naive trie), obtain the child with edge child_label from parent ; if it does not exist, an invalid Index is returned. More... | |
Index | findChildBFS_ (const Index parent, const AA child_label) const |
After compression (BFS trie), obtain the child with edge child_label from parent ; if it does not exist, an invalid Index is returned. More... | |
Private Attributes | |
std::vector< ACNode > | trie_ |
the trie, in either naive structure or BFS order (after compressTrie) More... | |
uint32_t | needle_count_ {0} |
total number of needles in the trie More... | |
uint32_t | max_aaa_ {0} |
maximum number of ambAAs allowed More... | |
uint32_t | max_mm_ {0} |
maximum number of mismatches allowed More... | |
std::unordered_map< Index, std::vector< uint32_t > > | umap_index2needles_ |
maps a node to which needles end there (valid for both naive and BFS tree) More... | |
std::unordered_map< Index, std::vector< Index > > | umap_index2children_naive_ |
maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly More... | |
An Aho Corasick trie (a set of nodes with suffix links mainly)
ACTrie | ( | uint32_t | max_aaa = 0 , |
uint32_t | max_mm = 0 |
||
) |
Default C'tor which just creates a root node.
max_aaa | Maximum number of ambiguous amino acids (B,J,Z,X) allowed in a hit |
max_mm | Maximum number of mismatched amino acids allowed in a hit |
~ACTrie | ( | ) |
D'tor.
Insert a new child node into the trie (unless already present) when starting at parent node from
and following the edge labeled edge
. Return the index of the (new) child node. Note: This operates on the naive trie, not the BFS.
Add all hits occurring in node i
(including all its suffix hits)
i | The ACNode where a needle ends (also all its suffices are checked) | |
text_pos | current position in query (i.e. end of matched hit) | |
[out] | hits | Result vector which will be expanded with hits (if any) |
|
private |
same as addHits_, but only follows the suffix chain until the spawn loses its prefix
void addNeedle | ( | const std::string & | needle | ) |
Add a needle to build up the trie. Call compressTrie() after the last needle was added before searching
Exception::InvalidValue | if needle contains an invalid amino acid (such as '*') |
void addNeedles | ( | const std::vector< std::string > & | needles | ) |
Convenience function; adds needles to build up the trie. Call compressTrie() after the last needle was added before searching
Exception::InvalidValue | if needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception |
void addNeedlesAndCompress | ( | const std::vector< std::string > & | needles | ) |
Convenience function which adds needles and immediately compresses the trie (i.e. no more needles can be added)
Exception::InvalidValue | if needles contains an invalid amino acid (such as '*'); you can use getNeedleCount() to trace which needle did cause the exception |
void compressTrie | ( | ) |
Traverses the trie in BFS order and makes it more compact and efficient to traverse.
Also creates the suffix links.
Call this function after adding all needles, and before searching any queries.
|
private |
Same as createSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA
to except_toAA
and the except_edge
itself.
|
private |
Same as createSubSpawns_, but instantiate all possible AA's except for those in the range from except_fromAA
to except_toAA
and the except_edge
itself.
|
private |
Create spawns from an AAA or MM, starting at trie node i
, following edges in range fromAA
to toAA
The number of AAA's/MM's left for the spawn must be passed (these numbers already reflect the original edge label)
|
private |
Create spawns from a spawn with an AAA or MM, using prototype
as template, following edges in range fromAA
to toAA
.
After compression (BFS trie), obtain the child with edge child_label
from parent
; if it does not exist, an invalid Index is returned.
During needle addition (naive trie), obtain the child with edge child_label
from parent
; if it does not exist, an invalid Index is returned.
Starting at node i
, find the child with label edge
If no child exists, follow the suffix link and try again (until root is reached) Note: This operates on the BFS trie (after compressTrie()), not the naive one.
|
private |
Advances spawn
by consuming edge
; same as follow_(), just for a spawn Returns true if spawn is still alive, false otherwise
void getAllHits | ( | ACTrieState & | state | ) | const |
Collects all hits from the current position in the query until the end of the query I.e. similar to while(next(state)) merge(hits_all, state.hits);
uint32_t getMaxAAACount | ( | ) | const |
Maximum number of ambiguous amino acids allowed during search.
uint32_t getMaxMMCount | ( | ) | const |
Maximum number of mismatches allowed during search.
size_t getNeedleCount | ( | ) | const |
How many needles were added to the trie?
bool nextHits | ( | ACTrieState & | state | ) | const |
Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits
is cleared & filled and true is returned. If the query ends and there is no hit, false is returned.
|
private |
Resume search at the last position in the query and node in the trie. If a node (or any suffices) are a hit, then state.hits
is NOT cleared, but filled and true is returned. If the query ends and all spawns are processed, false is returned (but hits might still have changed)
void setMaxAAACount | ( | const uint32_t | max_aaa | ) |
Set maximum number of ambiguous amino acids allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.
void setMaxMMCount | ( | const uint32_t | max_mm | ) |
Set maximum number of mismatches allowed during search. This must not be called in the middle of a search. Otherwise search results will be mixed.
|
private |
Same as follow_, but considers trying mismatches and AAA's if possible (by adding spawns to state
)
edge
|
private |
Same as follow_, but considers trying mismatches and AAA's if possible (by adding spawns to state
)
|
private |
maximum number of ambAAs allowed
|
private |
maximum number of mismatches allowed
|
private |
total number of needles in the trie
|
private |
the trie, in either naive structure or BFS order (after compressTrie)
maps the child nodes of a node for the naive tree; only needed during naive trie construction; storing children in the BFS trie modeled in the ACNodes directly
|
private |
maps a node to which needles end there (valid for both naive and BFS tree)