Home  · Classes  · Annotated Classes  · Modules  · Members  · Namespaces  · Related Pages
Clustering

In OpenMS, generic hierarchical clustering is available, the example (Tutorial_Clustering.cpp) shows how to build a rudimentary clustering pipeline.

Inputdata

All types of data can be clustered, as long as a SimilarityComparator for the type is provided. This Comparator has to produce a similarity measurement with the ()-operator in the range of [0,1] for each two elements of this type, so it can be transformed to a distance. Some SimilarityComparators are already implemented, e.g. the base class for the PeakSpectrum-type SimilarityComparator is OpenMS::PeakSpectrumCompareFunctor.

class LowLevelComparator
{
public:
double operator()(const double first, const double second) const
{
double x, y;
x = min(second, first);
y = max(first, second);
if ((y - x) > 1)
{
throw Exception::InvalidRange(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION);
}
return 1 - (y - x);
}
}; // end of LowLevelComparator

This example of a SimilarityComparator is very basic and takes one-dimensional input of doubles in the range of [0,1]. Real input will generally be more complex and so has to be the corresponding SimilarityComparator. Note that similarity in the example is calculated by 1-distance, whereas generally distance is obtained by getting the similarity and not the other way round.

Clustering

Clustering is conducted in the OpenMS::ClusterHierarchical class that offers an easy way to perform the clustering.

{
// data
vector<double> data; // must be filled
LowLevelComparator llc;
CompleteLinkage sl;
vector<BinaryTreeNode> tree;
DistanceMatrix<float> dist; // will be filled
ClusterHierarchical ch;
ch.setThreshold(0.15);

The ClusterHierarchical functions will need at least these arguments, setting the threshold is optional (per default set to 1,0). The template-arguments have to be set to the type of clustered data and the type of CompareFunctor used. In this example double and LowLevelComparator.

// clustering
ch.cluster<double, LowLevelComparator>(data, llc, sl, tree, dist);

This function will create a hierarchical clustering up to the threshold. See Output.

Output

If known, at what threshold (see OpenMS::ClusterHierarchical::cluster) a reasonable clustering is produced, the setting of the right threshold can potentially speed up the clustering process. After exceeding the threshold, the resulting tree (std::vector of OpenMS::BinaryTreeNode) is filled with dummy nodes. The tree represents the hierarchy of clusters by storing the stepwise merging process. It can eventually be transformed to a tree-representation in Newick-format and/or be analysed with other methods the OpenMS::ClusterAnalyzer class provides.

ClusterAnalyzer ca;
std::cout << ca.newickTree(tree) << std::endl;
return 0;
} //end of main

So the output will look something like this (may actually vary since random numbers are used in this example):

( ( ( ( ( 0 , 1 ) , ( 2 , ( 7 , 8 ) ) ) , ( ( 3 , 10 ) , ( 4 , 5 ) ) ) , ( 6 , 9 ) ) , 11 )

For closer survey of the clustering process one can also view the whole hierarchy by viewing the tree in Newick-format with a tree viewer such as TreeViewX. A visualization of a particular cluster step (which gives rise to a certain partition of the data clustered) can be created with heatmaps (for example with gnuplot 4.3 heatmaps and the corresponding distance matrix).


OpenMS / TOPP release 2.3.0 Documentation generated on Tue Jan 9 2018 18:22:05 using doxygen 1.8.13