In OpenMS, generic hierarchical clustering is available, the example (Tutorial_Clustering.cpp) shows how to build a rudimentary clustering pipeline.
All types of data can be clustered, as long as a SimilarityComparator for the type is provided. This Comparator has to produce a similarity measurement with the ()-operator in the range of [0,1] for each two elements of this type, so it can be transformed to a distance. Some SimilarityComparators are already implemented, e.g. the base class for the PeakSpectrum-type SimilarityComparator is OpenMS::PeakSpectrumCompareFunctor.
This example of a SimilarityComparator is very basic and takes one-dimensional input of doubles in the range of [0,1]. Real input will generally be more complex and so has to be the corresponding SimilarityComparator. Note that similarity in the example is calculated by 1-distance, whereas generally distance is obtained by getting the similarity and not the other way round.
Clustering is conducted in the OpenMS::ClusterHierarchical class that offers an easy way to perform the clustering.
The ClusterHierarchical functions will need at least these arguments, setting the threshold is optional (per default set to 1,0). The template-arguments have to be set to the type of clustered data and the type of CompareFunctor used. In this example double and LowLevelComparator.
This function will create a hierarchical clustering up to the threshold. See Output.
If known, at what threshold (see OpenMS::ClusterHierarchical::cluster) a reasonable clustering is produced, the setting of the right threshold can potentially speed up the clustering process. After exceeding the threshold, the resulting tree (std::vector of OpenMS::BinaryTreeNode) is filled with dummy nodes. The tree represents the hierarchy of clusters by storing the stepwise merging process. It can eventually be transformed to a tree-representation in Newick-format and/or be analysed with other methods the OpenMS::ClusterAnalyzer class provides.
So the output will look something like this (may actually vary since random numbers are used in this example):
For closer survey of the clustering process one can also view the whole hierarchy by viewing the tree in Newick-format with a tree viewer such as TreeViewX. A visualization of a particular cluster step (which gives rise to a certain partition of the data clustered) can be created with heatmaps (for example with gnuplot 4.3 heatmaps and the corresponding distance matrix).
OpenMS / TOPP release 2.3.0 | Documentation generated on Tue Jan 9 2018 18:22:05 using doxygen 1.8.13 |