In OpenMS, generic hierarchical clustering is available, the example (Tutorial_Clustering.cpp) shows how to build a rudimentary clustering pipeline.

Inputdata

All types of data can be clustered, as long as a SimilarityComparator for the type is provided. This Comparator has to produce a similarity measurement with the ()-operator in the range of [0,1] for each two elements of this type, so it can be transformed to a distance. Some SimilarityComparators are already implemented, e.g. the base class for the PeakSpectrum-type SimilarityComparator is OpenMS::PeakSpectrumCompareFunctor.

class LowLevelComparator
{
public:
  double operator()(const double first, const double second) const
  {
    double x, y;
    x = min(second, first);
    y = max(first, second);
    if ((y - x) > 1)
    {
      throw Exception::InvalidRange(__FILE__, __LINE__, OPENMS_PRETTY_FUNCTION);
    }
    return 1 - (y - x);
  }
}; // end of LowLevelComparator

This example of a SimilarityComparator is very basic and takes one-dimensional input of doubles in the range of [0,1]. Real input will generally be more complex and so has to be the corresponding SimilarityComparator. Note that similarity in the example is calculated by 1-distance, whereas generally distance is obtained by getting the similarity and not the other way round.

Clustering

Clustering is conducted in the OpenMS::ClusterHierarchical class that offers an easy way to perform the clustering.

Int main()
{
  // data
  vector<double> data; // must be filled
  LowLevelComparator llc;
  CompleteLinkage sl;
  vector<BinaryTreeNode> tree;
  DistanceMatrix<float> dist; // will be filled
  ClusterHierarchical ch;
  ch.setThreshold(0.15);

The ClusterHierarchical functions will need at least these arguments, setting the threshold is optional (per default set to 1,0). The template-arguments have to be set to the type of clustered data and the type of CompareFunctor used. In this example double and LowLevelComparator.

// clustering

ch.cluster<double, LowLevelComparator>(data, llc, sl, tree, dist);

This function will create a hierarchical clustering up to the threshold. See Output.

Output

If known, at what threshold (see OpenMS::ClusterHierarchical::cluster) a reasonable clustering is produced, the setting of the right threshold can potentially speed up the clustering process. After exceeding the threshold, the resulting tree (std::vector of OpenMS::BinaryTreeNode) is filled with dummy nodes. The tree represents the hierarchy of clusters by storing the stepwise merging process. It can eventually be transformed to a tree-representation in Newick-format and/or be analysed with other methods the OpenMS::ClusterAnalyzer class provides.

  ClusterAnalyzer ca;
  std::cout << ca.newickTree(tree) << std::endl;
  return 0;
} //end of main

So the output will look something like this (may actually vary since random numbers are used in this example):

( ( ( ( ( 0 , 1 ) , ( 2 , ( 7 , 8 ) ) ) , ( ( 3 , 10 ) , ( 4 , 5 ) ) ) , ( 6 , 9 ) ) , 11 )

For closer survey of the clustering process one can also view the whole hierarchy by viewing the tree in Newick-format with a tree viewer such as TreeViewX. A visualization of a particular cluster step (which gives rise to a certain partition of the data clustered) can be created with heatmaps (for example with gnuplot 4.3 heatmaps and the corresponding distance matrix).